Skip to content

Conversation

@JH-Leon-KIM-AMD
Copy link
Contributor

@JH-Leon-KIM-AMD JH-Leon-KIM-AMD commented Nov 4, 2025

Proposed changes

This PR adds factory support for the remaining forward convolution device operations in CK Builder.

Jira Ticket: https://amd-hub.atlassian.net/jira/software/c/projects/ALMIOPEN/boards/319/backlog?selectedIssue=ALMIOPEN-350

Task 350 - Add remaining forward convolution device operations:

  1. DeviceGroupedConvFwdDlMultipleD_NHWC_KYXC_NHWK

    • Added DL (Direct Load) factory specialization for NHWC layout-specific convolution
    • Created new DL-specific algorithm descriptor with 30 template parameters
    • Added test helper function and 3 test cases covering DEFAULT and FILTER_1X1_PAD0 specializations
    • All tests passing (15/15 builder tests)
  2. DeviceGroupedConvFwdMultipleD_Xdl_CShuffle_Large_Tensor

    • Added Large_Tensor factory specialization for N-dimension splitting (large-than-memory tensors)
    • Implemented macro collision workaround using pragma push/pop for GridwiseGemmTemplateParameters
    • Reuses existing XDL algorithm descriptor (42 identical template parameters)
    • Added test helper function and 2 test cases covering DEFAULT and FILTER_1X1_PAD0 specializations
    • All tests passing (15/15 builder tests)

This completes Task 350 - all 4 forward convolution device operations are now supported in CK Builder:

  • ✅ DeviceGroupedConvFwdMultipleABD_Xdl_CShuffle (by Ville)
  • ✅ DeviceGroupedConvFwdMultipleD_Wmma_CShuffle (by Ville)
  • ✅ DeviceGroupedConvFwdDlMultipleD_NHWC_KYXC_NHWK (this PR)
  • ✅ DeviceGroupedConvFwdMultipleD_Xdl_CShuffle_Large_Tensor (this PR)

Checklist

  • I have added tests relevant to the introduced functionality, and the unit tests are passing locally
  • I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run
  • I have added inline documentation which enables the maintainers with understanding the motivation
  • I have removed the stale documentation which is no longer relevant after this pull request
  • (If this is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
  • I have run clang-format on all changed files
  • Any dependent changes have been merged

Discussion

Design Decisions

1. DL Algorithm Descriptor (30 parameters)

  • DL uses VALU instructions instead of XDL matrix cores, requiring different parameter structure
  • Fixed NHWC_KYXC_NHWK layout only (not flexible like XDL)
  • Created separate descriptor type: ConvAlgorithm_DeviceGroupedConvFwdDlMultipleD_NHWC_KYXC_NHWK
  • New DL-specific concepts: DlThreadConfigDescriptor, DlThreadClusterDescriptor, DlBlockTransferK0M0M1K1Descriptor, etc.

2. Large_Tensor Descriptor Reuse (42 parameters)

  • Large_Tensor has identical template parameters to regular XDL CShuffle
  • Only difference: internal SplitN=true flag in device operation (not exposed in factory interface)
  • Reuses existing descriptor: ConvAlgorithm_DeviceGroupedConvFwdMultipleABD_Xdl_CShuffle
  • No new descriptor or concepts needed

3. Macro Collision Workaround

  • Both device_grouped_conv_fwd_multiple_abd_xdl_cshuffle.hpp (line 41) and device_grouped_conv_fwd_multiple_d_xdl_large_tensor_cshuffle.hpp (line 51) define GridwiseGemmTemplateParameters macro without #undef
  • Used #pragma push_macro/#pragma pop_macro to isolate Large_Tensor header's macro scope
  • This may need to fix on CK headers

vpietila-amd and others added 13 commits October 29, 2025 13:33
- Added 5 DL descriptor structs (39 configurable parameters)
- Added 10 C++20 concepts for type-safe validation
- Updated factory to read all parameters from descriptors
- Updated test helper to populate all descriptors
- All tests passing (13/13 including 3 new DL tests)
…huffle_Large_Tensor

- Add factory specialization for Large_Tensor device operation (conv_factory.hpp lines 1145-1265)
- Add macro collision workaround using pragma push/pop (conv_factory.hpp lines 43-51)
- Add test helper function run_test_DeviceGroupedConvFwdMultipleD_Xdl_CShuffle_Large_Tensor
- Add builder test file test_ckb_conv_fwd_2d_large_tensor_fp16.cpp with 2 test cases
- Update CMakeLists.txt to include new test file
- Reuse existing ConvAlgorithm_DeviceGroupedConvFwdMultipleABD_Xdl_CShuffle descriptor
- Map all 42 template parameters identical to regular XDL CShuffle
- All 15 builder tests passing including 2 new Large_Tensor tests

Completes Task 350: All 4 forward convolution device operations now supported in CK Builder.
vpietila-amd
vpietila-amd previously approved these changes Nov 5, 2025
Copy link
Contributor

@vpietila-amd vpietila-amd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. There's a PR #3154 that will reduce the code duplication in the test code. If that PR goes in first, we need to do the same changes for this PR too. But they are relatively straightforward changes. Once this PR is merged in the develop branch, I'll do another refactoring to remove the explicit device op flag from the signature.

@@ -0,0 +1,69 @@
// SPDX-License-Identifier: MIT
// Copyright (c) 2025, Advanced Micro Devices, Inc. All rights reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check that the Copyright follows the convention introduced in PR #3150. Same for the other new test file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@JH-Leon-KIM-AMD JH-Leon-KIM-AMD changed the title Jeonghyun/ckb add remining fwd conv device ops [CK_BUILDER]ckb add remining fwd conv device ops Nov 6, 2025
- Change copyright format to: Copyright (C) Advanced Micro Devices, Inc., or its affiliates.
- Reorder headers: Copyright first, then SPDX-License-Identifier
- Updated files:
  * experimental/builder/test/conv/test_ckb_conv_fwd_2d_dl_fp16.cpp
  * experimental/builder/test/conv/test_ckb_conv_fwd_2d_large_tensor_fp16.cpp
  * experimental/builder/include/ck_tile/builder/device_op_types.hpp
@shumway shumway merged commit 5f3cae3 into develop Nov 7, 2025
49 checks passed
@shumway shumway deleted the jeonghyun/ckb-add-remining-fwd-conv-device-ops branch November 7, 2025 00:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants