Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge Cutlass 3.6 #169

Merged
merged 56 commits into from
Dec 5, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
56 commits
Select commit Hold shift + click to select a range
19b4c5e
Fix isnan namespace qualification in cutlass/functional.h (#1679)
mhoemmen Aug 5, 2024
e22ba59
support data type w2 used in cutlass_library (#1517)
gavinchen430 Aug 6, 2024
2049c6c
5476 cutlass 3x gemm kernels (#1695)
depaulmillz Aug 8, 2024
7192f4a
Add CLayout_64x208 (#1680)
tridao Aug 8, 2024
4e5a8f6
3.5.1 plots and updated readme (#1708)
depaulmillz Aug 12, 2024
fb17043
Update half.h (#1709)
eqy Aug 14, 2024
8d8cfdf
update 3.5.1 readme/changelog
hwu36 Aug 15, 2024
865be73
Merge pull request #1713 from NVIDIA/351_sparse_update
d-k-b Aug 15, 2024
b0296bf
fix uint128
hwu36 Aug 16, 2024
3f084f7
Add couple configs into generator.py for mixed input MM (#1350)
alexsamardzic Aug 16, 2024
f93a691
Merge pull request #1714 from NVIDIA/u128_div
d-k-b Aug 16, 2024
4dbf5db
Use CUDA runtime API to retrieve function pointer to driver API (#1700)
shunfan-shao Aug 19, 2024
f7b19de
minor fix for a double quote in CMakeLists.txt (#1727)
Shreya-gaur Aug 20, 2024
e1976da
Add support for mixed 4-bit/8-bit data types GEMM (#1413)
alexsamardzic Aug 30, 2024
6c30441
Update barrier.h (#1782)
Algy Sep 4, 2024
7369adc
Add Sm90LinCombPerColBias (#1774)
ucassjy Sep 4, 2024
06e3377
Remove extraneous comma in declaration (#1776)
saagarjha Sep 5, 2024
82f5075
set_slice3x3 -> set_slice_3x3 (#1784)
lucifer1004 Sep 6, 2024
323c817
Support ComputeFn where output type differs from input type (#1771)
tridao Sep 6, 2024
21d0534
fix assertion (#1790)
seanxwzhang Sep 9, 2024
dbdae51
Support for TMA Epilogue for Group Gemm and add pingpong ptr array & …
Junkai-Wu Sep 11, 2024
3a8c01a
Prefix a member template name with the template keyword. (#1796)
shumway Sep 11, 2024
9f68995
add publication: ‘EVT: Accelerating Deep Learning Training with Epilo…
reed-lau Sep 16, 2024
1ebda1c
Fix MMA promotion interval assertions (#1641)
LyricZhao Sep 16, 2024
2991ce1
Add print_svg for mma (#1733)
reed-lau Sep 18, 2024
44dae8b
Adjust profiler space for SM89 (#1553)
wenlei-bao Sep 19, 2024
e2b0789
Add some can implement rules of hopper convolution. (#1835)
Junkai-Wu Sep 25, 2024
b27c49e
Fix cute doc (#1529)
jiweibo Oct 7, 2024
477a677
Fix typos in test/unit/conv/cache_testbed_output.h (#1652)
alexander-zinoviev Oct 7, 2024
0837a2a
Fix typo in comment (#1787)
sjfeng1999 Oct 7, 2024
cc3c29a
CUTLASS 3.6.0 (#1850)
yzhaiustc Oct 9, 2024
5366879
Handle MNK Sm90{Row, Col}Reduction problem shapes (#1803)
saagarjha Oct 14, 2024
755194a
add is_last_tile
hwu36 Oct 17, 2024
08101d9
Improve sm90 mixed dtype kernel (#1883)
sklevtsov-nvidia Oct 18, 2024
5b50a8f
Add GMMA shape m64n40k16 (#1864)
tridao Oct 22, 2024
d65266a
Add all supported GMMA shapes (#1890)
sklevtsov-nvidia Oct 22, 2024
f3a3bfc
add maximum support (#1833)
Xinyu302 Oct 23, 2024
ea69cc2
fix typo (#1853)
sijialouintel Oct 23, 2024
b0c09ed
fix by adding public (#1753)
Xinyu302 Oct 23, 2024
83ae20c
added mapping for bf16 to torch::kBFloat16 (#1843)
Bogumil-Sapinski-Mobica Oct 23, 2024
e5f3caf
Fix README (#1658)
leimao Oct 23, 2024
03e3bff
Adjusting code indentation (#1639)
103yiran Oct 23, 2024
f02913c
Include of regular_tile_iterator.h fixed for NVRTC (#1765)
MaxAkaAltmer Oct 23, 2024
12626bc
Update gemm_f16n_f16t_f32t_tensor_op_f32_sm80.cu with include "cutlas…
houqi Oct 23, 2024
be692b4
remove redundant hardcoded packing configs in mixed dtype gemm (#1894)
IwakuraRein Oct 23, 2024
a424ca6
fix wrong A/BLayout in MMA_Traits for binary mma and append other MMA…
CalebDu Oct 24, 2024
08a4995
Add a print for the uint{x}b_t type. (#1871)
luliyucoordinate Oct 24, 2024
e8a8b69
Refactor some GroupedGEMM logic (#1899)
azhurkevich Oct 26, 2024
19f5159
feat: support kFactor 8 used in mma tensor op tile iterator (#1512)
gavinchen430 Oct 29, 2024
9004ed2
Update publications (#1912)
wenlei-bao Nov 6, 2024
32e3c38
remove restriction of stride == kernel in nhwc_pooling (#1896)
thorneliu Nov 6, 2024
d656afb
fix undefined in device code error (#1880)
luliyucoordinate Nov 6, 2024
8aa95db
Fix the racing condition of mixed-input gemm when writing the registe…
IwakuraRein Nov 8, 2024
b0e09d7
Fix `cutlass` python library with cuda `12.6.2.post1` (#1942)
danthe3rd Nov 18, 2024
cbea514
Fix issues for Cutlass 3.6
aacostadiaz Dec 5, 2024
bd28827
Merge branch 'sycl-develop' into sync
aacostadiaz Dec 5, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
27 changes: 26 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,41 @@
# NVIDIA CUTLASS Changelog

## [3.6.0](https://github.com/NVIDIA/cutlass/releases/tag/v3.6.0) (2024-10-03)

- [Hopper structured sparse GEMM](./examples/62_hopper_sparse_gemm/62_hopper_sparse_gemm.cu).
+ [FP16](./test/unit/gemm/device/sm90_sparse_gemm_f16_f16_f32_tensor_op_f32.cu)
+ [FP8](./test/unit/gemm/device/sm90_sparse_gemm_f8_f8_f32_tensor_op_f32.cu)
+ [INT8](./test/unit/gemm/device/sm90_sparse_gemm_s8_s8_s32_tensor_op_s32.cu)
+ [TF32](./test/unit/gemm/device/sm90_sparse_gemm_tf32_tf32_f32_tensor_op_f32.cu)
- A refactor to the CUTLASS 3.x convolution `kernel::ConvUniversal` [API](./include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp) to bring it in line with `gemm::GemmUniversal`. Now the 3.x convolution API is no longer considered as a beta API.
- [An improved mixed input GEMM](./examples/55_hopper_mixed_dtype_gemm/README.md) and a [lookup table implementation](./examples/55_hopper_mixed_dtype_gemm/55_hopper_int4_fp8_gemm.cu) for `INT4`x`FP8` scale-only mode.
- [EVT nodes for Top-K selection and softmax](./include/cutlass/epilogue/fusion/sm90_visitor_topk_softmax.hpp) and [GEMM example using those](./examples/61_hopper_gemm_with_topk_and_softmax/61_hopper_gemm_with_topk_and_softmax.cu).
- [Programmatic Dependent Launch](./include/cutlass/arch/grid_dependency_control.h) (PDL) that leverages a new Hopper feature to speedup two back-to-back kernels, and its corresponding [documentations](./media/docs/dependent_kernel_launch.md).
- [A new debugging tool, synclog](./include/cutlass/arch/synclog.hpp), for dumping out all synchronization events from within a kernel to a file. Please see [synclog documentation](./media/docs/utilities.md#debugging-asynchronous-kernels-with-cutlasss-built-in-synclog-tool) for details.
- A new TMA-enabled [epilogue](./include/cutlass/epilogue/collective/sm90_epilogue_array_tma_warpspecialized.hpp) for grouped GEMM that brings significant performance improvement, as well as its EVT support.
- A SIMT-enabled pointer-array [epilogue](./include/cutlass/epilogue/collective/sm70_epilogue_vectorized_array.hpp).
- A new [Ping-Pong kernel schedule for Grouped GEMM](./include/cutlass/gemm/kernel/sm90_gemm_array_tma_warpspecialized_pingpong.hpp) and some other optimizations.
- [A new instantiation strategy for CUTLASS profiler kernels](./python/cutlass_library/sm90_shapes.py) along with [improved documentation for instantiation level in CUTLASS profiler](./media/docs/profiler.md#instantiating-more-kernels-with-hopper).
- A new hardware support for comparisons and computations of [`cutlass::bfloat16_t`](./include/cutlass/bfloat16.h)
- Fixed use of isnan on Windows for [`half_t`](./test/unit/core/functional.cu).
Various improvements and fixed from the community and CUTLASS team. Thanks to everyone who submitted PRs!

## [3.5.1](https://github.com/NVIDIA/cutlass/releases/tag/v3.5.1) (2024-07-25)

- [Minimal SM90 WGMMA + TMA GEMM example in 100 lines of code](./examples/cute/tutorial/wgmma_sm90.cu)
- [Exposure of L2 `cache_hint`s in TMA copy atoms](./include/cute/arch/copy_sm90_tma.hpp#L48)
- Exposure of raster order and tile swizzle extent in [CUTLASS library profiler](./media/docs/profiler.md#GEMM), and
[example 48](./examples/48_hopper_warp_specialized_gemm/48_hopper_warp_specialized_gemm.cu).
- [TMA store based and EVT supported epilogues](./include/cutlass/epilogue/collective/sm90_epilogue_array_tma_warpspecialized.hpp) for [Hopper pointer array batched kernels](./test/unit/gemm/device/sm90_gemm_f16_f16_f16_tensor_op_f32_ptr_array.cu).
- A new [`GemmSparseUniversal` API for CUTLASS 2.x Ampere kernels](./include/cutlass/gemm/device/gemm_sparse_universal.h) leveraging 2:4 structured sparsity and [support for LLM friendly tile sizes](./test/unit/gemm/device/gemm_f16n_f16t_f32t_tensor_op_f32_sparse_sm80.cu).
- A new [`GemmSparseUniversal` API for CUTLASS 2.x Ampere kernels](./include/cutlass/gemm/device/gemm_sparse_universal.h) to enable serial and parallel split-k for sparse tensor cores and new tiny tile sizes to better support LLM inferrence:
+ [FP16 TN](./test/unit/gemm/device/gemm_f16t_f16n_f32t_tensor_op_f32_sparse_sm80.cu#L269-L393) and [NT](./test/unit/gemm/device/gemm_f16n_f16t_f32t_tensor_op_f32_sparse_sm80.cu#L269-L411).
+ [int8 TN](./test/unit/gemm/device/gemm_s8t_s8n_s32t_tensor_op_s32_sparse_sm80.cu#L264-L452).
+ [int4 TN](./test/unit/gemm/device/gemm_s4t_s4n_s32t_tensor_op_s32_sparse_sm80.cu#L264-L452).
+ [FP32 TN](./test/unit/gemm/device/gemm_f32t_f32n_f32t_tensor_op_f32_sparse_sm80.cu#L427-L642) and [NT](./test/unit/gemm/device/gemm_f32n_f32t_f32t_tensor_op_f32_sparse_sm80.cu#L427-L456).
- [CUDA host adapter](./include/cutlass/cuda_host_adapter.hpp) extensions to support TMA descriptor construction driver APIs.
- Inclusion of more [Hopper fprop, dgrad, and wgrad convolution kernels in CUTLASS library and profiler](./python/cutlass_library/generator.py).
- Support for residual add (beta != 0) in convolution kernels.
- A new convolution [epilogue](./examples/16_ampere_tensorop_conv2dfprop/ampere_tensorop_conv2dfprop.cu#L269) for CUTLASS 2.x to support non-packed NHWC output.
- A refactor of [include files throughout CUTLASS core directories](./include/cutlass/gemm/collective/collective_mma_decl.hpp) to reduce circular dependencies and [tests to guard against them](./test/self_contained_includes/CMakeLists.txt).
- [A guide for setting up VSCode to work well with CUTLASS](./media/docs/ide_setup.md) and [expanded code style guide](./media/docs/programming_guidelines.md).
- Better support for MSVC as a host compiler.
Expand Down
65 changes: 42 additions & 23 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -177,7 +177,6 @@ set(CUTLASS_ENABLE_BENCHMARKS ON CACHE BOOL "Enable CUTLASS Benchmarks")
set(CUTLASS_ENABLE_TESTS ${CUTLASS_ENABLE_TESTS_INIT} CACHE BOOL "Enable CUTLASS Tests")
set(CUTLASS_ENABLE_GTEST_UNIT_TESTS ${CUTLASS_ENABLE_TESTS} CACHE BOOL "Enable CUTLASS GTest-based Unit Tests")
set(CUTLASS_USE_SYSTEM_GOOGLETEST OFF CACHE BOOL "Use system/external installation of GTest")

set(CUTLASS_USE_PACKED_TUPLE ON CACHE BOOL "If ON, make cute::tuple be new standard-layout tuple type; if OFF, use the original cute::tuple implementation that is _not_ standard-layout.")
if (CUTLASS_USE_PACKED_TUPLE)
list(APPEND CUTLASS_CUDA_NVCC_FLAGS -DCUTE_USE_PACKED_TUPLE=1)
Expand Down Expand Up @@ -315,6 +314,7 @@ set(CUTLASS_LIBRARY_OPERATIONS "all" CACHE STRING "Comma-delimited list of opera
set(CUTLASS_LIBRARY_KERNELS ${CUTLASS_LIBRARY_KERNELS_INIT} CACHE STRING "Comma-delimited list of kernel name filters. If unspecified, only the largest tile size is enabled. If the string 'all' is specified, all kernels are enabled.")
set(CUTLASS_LIBRARY_IGNORE_KERNELS "" CACHE STRING "Comma-delimited list of kernels to exclude from build. This option ONLY takes effect if CUTLASS_LIBRARY_KERNELS is set.")
set(CUTLASS_LIBRARY_EXCLUDE_KERNELS "" CACHE STRING "Comma-delimited list of kernels to exclude from build. This option always takes effect, whether or not CUTLASS_LIBRARY_KERNELS is set. It also can exclude kernels from the filter file (see KERNEL_FILTER_FILE).")
set(CUTLASS_LIBRARY_INSTANTIATION_LEVEL "" CACHE STRING "Instantiation level for SM90 kernels. Set to `max` and make sure CUTLASS_LIBRARY_KERNELS is non-empty to stamp all possible kernel configurations.")

################################################################################

Expand Down Expand Up @@ -362,6 +362,8 @@ if(CUTLASS_ENABLE_SM90_EXTENDED_MMA_SHAPES)
list(APPEND CUTLASS_CUDA_NVCC_FLAGS -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
endif()

set(CUTLASS_SKIP_REDUCTION_INIT OFF CACHE BOOL "Disable init reduction workspace")

#
# NOTE: running with asan and CUDA requires the following environment variable:
#
Expand Down Expand Up @@ -389,6 +391,10 @@ if(CUTLASS_NVCC_EMBED_PTX)
list(APPEND CUTLASS_CUDA_CLANG_FLAGS --cuda-include-ptx=all)
endif()

if (CUTLASS_SKIP_REDUCTION_INIT)
list(APPEND CUTLASS_CUDA_NVCC_FLAGS -DCUTLASS_SKIP_REDUCTION_INIT=1)
endif()

if (CUTLASS_ENABLE_TENSOR_CORE_MMA)
list(APPEND CUTLASS_CUDA_FLAGS -DCUTLASS_ENABLE_TENSOR_CORE_MMA=1)
endif()
Expand All @@ -398,6 +404,18 @@ if (CUTLASS_PROFILER_DISABLE_REFERENCE)
list(APPEND CUTLASS_CUDA_NVCC_FLAGS -DCUTLASS_PROFILER_DISABLE_REFERENCE=1)
endif()

if (CUTLASS_ENABLE_GDC_FOR_SM90)
message(STATUS "Grid Dependency Control (GDC) is enabled for SM90 kernels (required for programmatic dependent launches).")
list(APPEND CUTLASS_CUDA_NVCC_FLAGS -DCUTLASS_ENABLE_GDC_FOR_SM90=1)
endif()

set(CUTLASS_ENABLE_SYNCLOG OFF CACHE BOOL "Enable synchronization event logging for race condition debugging. WARNING: This redefines __syncthreads() and __syncwarp() in all downstream code!")

if (CUTLASS_ENABLE_SYNCLOG)
set(CMAKE_CUDA_SEPARABLE_COMPILATION ON)
string(APPEND CMAKE_CXX_FLAGS " -DCUTLASS_ENABLE_SYNCLOG=1")
string(APPEND CMAKE_CUDA_FLAGS " -DCUTLASS_ENABLE_SYNCLOG=1")
endif()



Expand Down Expand Up @@ -926,12 +944,27 @@ function(cutlass_add_executable_tests NAME TARGET)

set(TEST_GROUP_NAME ${NAME})

# To run the tests from an install package with tests enabled, we need to generate test files
# that don't rely on the current directory structure in build.

set(TEST_NAME c${NAME})
set(TEST_GEN_DIR ${CMAKE_CURRENT_BINARY_DIR}/ctest/${TEST_NAME})
file(MAKE_DIRECTORY ${TEST_GEN_DIR})

set(TEST_EXE_PATH $<TARGET_FILE:${TARGET}>)
set(TEST_USE_EXTENDED_FORMAT ON)
configure_file("${CUTLASS_CTEST_TEMPLATE_FILE}" "${TEST_GEN_DIR}/CTestTestfile.${TEST_NAME}.cmake" @ONLY)

set(TEST_EXE_PATH $<TARGET_FILE_NAME:${TARGET}>)
set(TEST_USE_EXTENDED_FORMAT OFF) # ctest does not support extended add_test format.
configure_file("${CUTLASS_CTEST_TEMPLATE_FILE}" "${TEST_GEN_DIR}/CTestTestfile.${TEST_NAME}.install.cmake.in" @ONLY)

foreach(CMD_OPTIONS_VAR IN LISTS __TEST_COMMAND_OPTIONS)

if (CMD_COUNT GREATER 1)
string(TOLOWER "${NAME}_${CMD_OPTIONS_VAR}" TEST_NAME)
string(TOLOWER "${NAME}_${CMD_OPTIONS_VAR}" TESTCASE_NAME)
else()
string(TOLOWER "${NAME}" TEST_NAME)
string(TOLOWER "${NAME}" TESTCASE_NAME)
endif()

# The following rigmarole is needed to deal with spaces and possible quotes in
Expand All @@ -945,42 +978,28 @@ function(cutlass_add_executable_tests NAME TARGET)
separate_arguments(TEST_COMMAND_OPTIONS)

add_custom_target(
${TEST_NAME}
${TESTCASE_NAME}
COMMAND
${CUTLASS_TEST_EXECUTION_ENVIRONMENT} $<TARGET_FILE:${TARGET}> ${TEST_COMMAND_OPTIONS}
DEPENDS
${TARGET}
)

if (CMD_COUNT GREATER 1)
add_dependencies(${NAME} ${TEST_NAME})
add_dependencies(${NAME} ${TESTCASE_NAME})
endif()

foreach(DEPENDEE ${__DEPENDEES})
add_dependencies(${DEPENDEE} ${TEST_NAME})
add_dependencies(${DEPENDEE} ${TESTCASE_NAME})
endforeach()

set(TEST_NAME c${TEST_NAME})
set(TESTCASE_NAME c${TESTCASE_NAME})
string(CONFIGURE "${_INLINE_PER_TEST_CODE_TEMPLATE}" _TEST_CODE @ONLY)
string(APPEND _INLINE_PER_TEST_CODE "${_TEST_CODE}")
file(APPEND "${TEST_GEN_DIR}/CTestTestfile.${TEST_NAME}.cmake" "${_TEST_CODE}")
file(APPEND "${TEST_GEN_DIR}/CTestTestfile.${TEST_NAME}.install.cmake.in" "${_TEST_CODE}")

endforeach()

# To run the tests from an install package with tests enabled, we need to generate test files
# that don't rely on the current directory structure in build.

set(TEST_NAME c${NAME})
set(TEST_GEN_DIR ${CMAKE_CURRENT_BINARY_DIR}/ctest/${TEST_NAME})
file(MAKE_DIRECTORY ${TEST_GEN_DIR})

set(TEST_EXE_PATH $<TARGET_FILE:${TARGET}>)
set(TEST_USE_EXTENDED_FORMAT ON)
configure_file("${CUTLASS_CTEST_TEMPLATE_FILE}" "${TEST_GEN_DIR}/CTestTestfile.${TEST_NAME}.cmake" @ONLY)

set(TEST_EXE_PATH $<TARGET_FILE_NAME:${TARGET}>)
set(TEST_USE_EXTENDED_FORMAT OFF) # ctest does not support extended add_test format.
configure_file("${CUTLASS_CTEST_TEMPLATE_FILE}" "${TEST_GEN_DIR}/CTestTestfile.${TEST_NAME}.install.cmake.in" @ONLY)

# The following line imports the tests for immediate run via `make test`.

include(${TEST_GEN_DIR}/CTestTestfile.${TEST_NAME}.cmake)
Expand Down
8 changes: 8 additions & 0 deletions PUBLICATIONS.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,12 @@

## 2024

- ["ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference"](https://arxiv.org/abs/2410.21465). Hanshi Sun, Li-Wen Chang, Wenlei Bao, Size Zheng, Ningxin Zheng, Xin Liu, Harry Dong, Yuejie Chi, Beidi Chen. _arXiv_, October 2024.

- ["FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion"](https://arxiv.org/abs/2406.06858). Li-Wen Chang, Wenlei Bao, Qi Hou, Chengquan Jiang, Ningxin Zheng, Yinmin Zhong, Xuanrun Zhang, Zuquan Song, Chengji Yao, Ziheng Jiang, Haibin Lin, Xin Jin, Xin Liu. _arXiv_, June 2024.

- ["EVT: Accelerating Deep Learning Training with Epilogue Visitor Tree"](https://dl.acm.org/doi/10.1145/3620666.3651369). Zhaodong Chen, Andrew Kerr, Richard Cai, Jack Kosaian, Haicheng Wu, Yufei Ding, and Yuan Xie. _Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems_, April 2024.

- ["Faster Neighborhood Attention: Reducing the O(n^2) Cost of Self Attention at the Threadblock Level"](https://arxiv.org/abs/2403.04690). Ali Hassani, Wen-Mei Hwu, Humphrey Shi. _arXiv_, March 2024.

## 2023
Expand All @@ -24,6 +30,8 @@

- ["Mixed Precision Post Training Quantization of Neural Networks with Sensitivity Guided Search"](https://arxiv.org/abs/2302.01382). Clemens JS Schaefer, Elfie Guo, Caitlin Stanton, Xiaofan Zhang, Tom Jablin, Navid Lambert-Shirzad, Jian Li, Chiachen Chou, Siddharth Joshi, Yu Emma Wang. _arXiv_, Feburary 2023.

- ["Dynamic N:M Fine-Grained Structured Sparse Attention Mechanism"](https://dl.acm.org/doi/abs/10.1145/3572848.3577500). Zhaodong Chen, Zheng Qu, Yuying Quan, Liu Liu, Yufei Ding, Yuan Xie. _Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming_, Feburary 2023.

- ["Stream-K: Work-centric Parallel Decomposition for Dense Matrix-Matrix Multiplication on the GPU"](https://arxiv.org/abs/2301.03598). Muhammad Osama, Duane Merrill, Cris Cecka, Michael Garland, John D. Owens. _arXiv_, January 2023.

## 2022
Expand Down
Loading
Loading