Merge Cutlass version 3.6 #162

aacostadiaz · 2024-11-29T11:50:13Z

This PR merges the changes from cutlass version 3.6

* Fix unrelated MSVC build warnings * Fix use of isnan in functional.h Correct namespace qualification of isnan in functional.h so that it invokes cutlass::isnan for half_t, instead of converting half_t to float and invoking std::isnan (on host, or ::isnan on device).

Co-authored-by: dePaul Miller <[email protected]>

Without this I get compilation error when the extended shapes are enabled

Co-authored-by: dePaul Miller <[email protected]>

update 3.5.1 readme/changelog

* Add couple configs into generator.py for mixed input MM * change one unit test name; reenable 128x32 in the profiler * Added U8/BF16 tests. --------- Co-authored-by: Haicheng Wu <[email protected]> Co-authored-by: Haicheng Wu <[email protected]>

fix uint128

…IA#1700) * Query pfn to driver api * use default for older toolkits --------- Co-authored-by: shunfans <[email protected]>

* Add support for mixed 4-bit/8-bit data types GEMM * fix ( and ) --------- Co-authored-by: Aleksandar Samardžić <[email protected]> Co-authored-by: Haicheng Wu <[email protected]>

Co-authored-by: Jiayu Sun <[email protected]>

) This is useful for e.g. function taking in 2 float inputs and turn them to complex

…Group Gemm (NVIDIA#1795)

Fixes llvm buld error.

…gue Visitor Tree’ (NVIDIA#1526) Co-authored-by: Haicheng Wu <[email protected]>

* add print_svg for mma * correct the code indentation

Co-authored-by: Alexander Zinoviev <[email protected]>

* Include of regular_tile_iterator.h fixed for NVRTC * More include fixed for NVRTC

…s/gemm/device/gemm_universal.h" (NVIDIA#1569) fix compile with `cmake .. -DCUTLASS_ENABLE_TESTS=ON -DCUTLASS_TEST_LEVEL=2`

…A#1894) Co-authored-by: Siyuan Fu <[email protected]>

…_Traits support (NVIDIA#1856) * fix wrong A/BLayout in MMA_Traits<SM80_16x8x256_S32U1U1S32_TN_XORPOPC> and append support for m8n8k128, m16n8k128 mma.and.popc in MMA_Traits instantiation * add "print" template for subbyte_reference<T>

)

…rs (NVIDIA#1931) * move two warpgroup_wait * merge main --------- Co-authored-by: Siyuan Fu <[email protected]>

* Fix `cutlass` python library with cuda `12.6.2.post1` Previously we had this error: ``` File "/storage/home/cutlass/python/cutlass/backend/operation.py", line 39, in <listcomp> _version_splits = [int(x) for x in __version__.split("rc")[0].split(".")] ^^^^^^ ValueError: invalid literal for int() with base 10: 'post1' ``` * Update sm90_utils.py * Update generator.py * Update python/cutlass_library/generator.py Co-authored-by: Jack Kosaian <[email protected]> * Update python/cutlass_library/sm90_utils.py Co-authored-by: Jack Kosaian <[email protected]> --------- Co-authored-by: Jack Kosaian <[email protected]>

# Conflicts: # examples/CMakeLists.txt # include/cute/arch/copy_sm90_desc.hpp # include/cute/arch/util.hpp # include/cute/atom/mma_traits.hpp # include/cute/numeric/numeric_types.hpp # include/cutlass/arch/barrier.h # include/cutlass/epilogue/collective/collective_epilogue.hpp # include/cutlass/gemm/collective/collective_builder.hpp # include/cutlass/gemm/device/gemm.h # include/cutlass/gemm/device/gemm_universal_adapter.h # include/cutlass/gemm/kernel/sm90_gemm_array_tma_warpspecialized_cooperative.hpp # include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized.hpp # include/cutlass/gemm/kernel/sm90_tile_scheduler_stream_k.hpp # include/cutlass/platform/platform.h # tools/library/CMakeLists.txt # tools/util/include/cutlass/util/device_memory.h

# Conflicts: # test/unit/gemm/device/gemm_testbed_3x.hpp

joeatodd

LGTM - couple of small things

include/cutlass/numeric_conversion.h

include/cutlass/kernel_launch.h

mhoemmen and others added 30 commits August 5, 2024 14:28

support data type w2 used in cutlass_library (NVIDIA#1517)

e22ba59

5476 cutlass 3x gemm kernels (NVIDIA#1695)

2049c6c

Co-authored-by: dePaul Miller <[email protected]>

Add CLayout_64x208 (NVIDIA#1680)

7192f4a

Without this I get compilation error when the extended shapes are enabled

3.5.1 plots and updated readme (NVIDIA#1708)

4e5a8f6

Co-authored-by: dePaul Miller <[email protected]>

Update half.h (NVIDIA#1709)

fb17043

update 3.5.1 readme/changelog

8d8cfdf

Merge pull request NVIDIA#1713 from NVIDIA/351_sparse_update

865be73

update 3.5.1 readme/changelog

fix uint128

b0296bf

Merge pull request NVIDIA#1714 from NVIDIA/u128_div

f93a691

fix uint128

Use CUDA runtime API to retrieve function pointer to driver API (NVID…

4dbf5db

…IA#1700) * Query pfn to driver api * use default for older toolkits --------- Co-authored-by: shunfans <[email protected]>

minor fix for a double quote in CMakeLists.txt (NVIDIA#1727)

f7b19de

Add support for mixed 4-bit/8-bit data types GEMM (NVIDIA#1413)

e1976da

* Add support for mixed 4-bit/8-bit data types GEMM * fix ( and ) --------- Co-authored-by: Aleksandar Samardžić <[email protected]> Co-authored-by: Haicheng Wu <[email protected]>

Update barrier.h (NVIDIA#1782)

6c30441

Add Sm90LinCombPerColBias (NVIDIA#1774)

7369adc

Co-authored-by: Jiayu Sun <[email protected]>

Remove extraneous comma in declaration (NVIDIA#1776)

06e3377

set_slice3x3 -> set_slice_3x3 (NVIDIA#1784)

82f5075

Support ComputeFn where output type differs from input type (NVIDIA#1771

323c817

) This is useful for e.g. function taking in 2 float inputs and turn them to complex

fix assertion (NVIDIA#1790)

21d0534

Support for TMA Epilogue for Group Gemm and add pingpong ptr array & …

dbdae51

…Group Gemm (NVIDIA#1795)

Prefix a member template name with the template keyword. (NVIDIA#1796)

3a8c01a

Fixes llvm buld error.

add publication: ‘EVT: Accelerating Deep Learning Training with Epilo…

9f68995

…gue Visitor Tree’ (NVIDIA#1526) Co-authored-by: Haicheng Wu <[email protected]>

Fix MMA promotion interval assertions (NVIDIA#1641)

1ebda1c

Add print_svg for mma (NVIDIA#1733)

2991ce1

* add print_svg for mma * correct the code indentation

Adjust profiler space for SM89 (NVIDIA#1553)

44dae8b

Add some can implement rules of hopper convolution. (NVIDIA#1835)

e2b0789

Fix cute doc (NVIDIA#1529)

b27c49e

Fix typos in test/unit/conv/cache_testbed_output.h (NVIDIA#1652)

477a677

Co-authored-by: Alexander Zinoviev <[email protected]>

Fix typo in comment (NVIDIA#1787)

0837a2a

MaxAkaAltmer and others added 14 commits October 23, 2024 12:55

Include of regular_tile_iterator.h fixed for NVRTC (NVIDIA#1765)

f02913c

* Include of regular_tile_iterator.h fixed for NVRTC * More include fixed for NVRTC

Update gemm_f16n_f16t_f32t_tensor_op_f32_sm80.cu with include "cutlas…

12626bc

…s/gemm/device/gemm_universal.h" (NVIDIA#1569) fix compile with `cmake .. -DCUTLASS_ENABLE_TESTS=ON -DCUTLASS_TEST_LEVEL=2`

remove redundant hardcoded packing configs in mixed dtype gemm (NVIDI…

be692b4

…A#1894) Co-authored-by: Siyuan Fu <[email protected]>

Add a print for the uint{x}b_t type. (NVIDIA#1871)

08a4995

Refactor some GroupedGEMM logic (NVIDIA#1899)

e8a8b69

feat: support kFactor 8 used in mma tensor op tile iterator (NVIDIA#1512

19f5159

)

Update publications (NVIDIA#1912)

9004ed2

remove restriction of stride == kernel in nhwc_pooling (NVIDIA#1896)

32e3c38

fix undefined in device code error (NVIDIA#1880)

d656afb

Fix the racing condition of mixed-input gemm when writing the registe…

8aa95db

…rs (NVIDIA#1931) * move two warpgroup_wait * merge main --------- Co-authored-by: Siyuan Fu <[email protected]>

Solve issues

a119855

aacostadiaz force-pushed the aacosta/3.6 branch from 6fa4521 to a119855 Compare December 3, 2024 10:43

mehdi-goli approved these changes Dec 3, 2024

View reviewed changes

t4c1 approved these changes Dec 4, 2024

View reviewed changes

aacostadiaz and others added 6 commits December 4, 2024 12:02

Merge branch 'sycl-develop' into aacosta/3.6

2333de6

# Conflicts: # test/unit/gemm/device/gemm_testbed_3x.hpp

Add dAlpha & dBeta to epilogue callback args

9ffe59b

Guard CUDA stuff

4a4b9a9

Fix IsLegacyEpiloguePolicy & rename VectorBeta->VectorScale

684a5a8

Pass EpiloguePolicy, not Epilogue

fd197d4

Merge branch 'sycl-develop' into aacosta/3.6

e322028

joeatodd approved these changes Dec 5, 2024

View reviewed changes

include/cutlass/numeric_conversion.h Outdated Show resolved Hide resolved

include/cutlass/kernel_launch.h Show resolved Hide resolved

joeatodd approved these changes Dec 5, 2024

View reviewed changes

Use syclcompat::dp4a

ce8d683

aacostadiaz force-pushed the aacosta/3.6 branch from 174c0b9 to ce8d683 Compare December 5, 2024 13:12

mehdi-goli approved these changes Dec 5, 2024

View reviewed changes

aacostadiaz merged commit 36c455f into codeplaysoftware:sycl-develop Dec 5, 2024
5 checks passed

aacostadiaz mentioned this pull request Dec 5, 2024

Merge Cutlass 3.6 #169

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge Cutlass version 3.6 #162

Merge Cutlass version 3.6 #162

aacostadiaz commented Nov 29, 2024

joeatodd left a comment

Merge Cutlass version 3.6 #162

Merge Cutlass version 3.6 #162

Conversation

aacostadiaz commented Nov 29, 2024

joeatodd left a comment

Choose a reason for hiding this comment