Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve memory ordering of sync-free kernels #1344

Merged
merged 7 commits into from
Oct 10, 2023
Merged

Improve memory ordering of sync-free kernels #1344

merged 7 commits into from
Oct 10, 2023

Conversation

upsj
Copy link
Member

@upsj upsj commented Jun 6, 2023

This adds load_relaxed(_shared), load_acquire(_shared) and store_relaxed(_shared) and store_release(_shared) functions to provide limited atomic load/store support in NVIDIA GPUs.

TODO

  • Figure out deadlock on post-Volta

@upsj upsj self-assigned this Jun 6, 2023
@upsj upsj added the 1:ST:WIP This PR is a work in progress. Not ready for review. label Jun 6, 2023
@ginkgo-bot ginkgo-bot added reg:testing This is related to testing. mod:cuda This is related to the CUDA module. mod:reference This is related to the reference module. type:solver This is related to the solvers mod:hip This is related to the HIP module. type:factorization This is related to the Factorizations reg:helper-scripts This issue/PR is related to the helper scripts mainly concerned with development of Ginkgo. labels Jun 6, 2023
@upsj upsj mentioned this pull request Jul 10, 2023
@upsj upsj changed the title Improve Cholesky performance Improve memory ordering of sync-free kernels Jul 10, 2023
@upsj upsj changed the base branch from develop to column_cholesky July 10, 2023 14:54
@upsj upsj added the 1:ST:ready-for-review This PR is ready for review label Jul 12, 2023
@sonarcloud
Copy link

sonarcloud bot commented Jul 13, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

0.0% 0.0% Coverage
0.0% 0.0% Duplication

@upsj upsj force-pushed the column_cholesky branch 2 times, most recently from 2a2b5d3 to db38887 Compare July 25, 2023 15:29
Base automatically changed from column_cholesky to develop July 26, 2023 07:00
@upsj upsj requested a review from thoasm September 18, 2023 12:47
Copy link
Member

@thoasm thoasm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to have some small documentation and explanation inside the memory.cuh file to clarify things.
Also, I feel like the names of the functions aren't accurate as they do different things depending on the architecture.
But I am a fan of these new functions!

cuda/components/memory.cuh Show resolved Hide resolved
cuda/components/memory.cuh Show resolved Hide resolved
cuda/components/memory.cuh Show resolved Hide resolved
cuda/components/memory.cuh Outdated Show resolved Hide resolved
@upsj upsj removed the 1:ST:WIP This PR is a work in progress. Not ready for review. label Sep 22, 2023
Copy link
Member

@thoasm thoasm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing all my comments.

cuda/components/memory.cuh Outdated Show resolved Hide resolved
@upsj upsj requested a review from thoasm September 25, 2023 12:52
Copy link
Member

@thoasm thoasm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

}
}
__threadfence();
group::tiled_partition<subwarp_size>(group::this_thread_block()).sync();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it need sync warp?

Copy link
Member Author

@upsj upsj Oct 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, since only a single lane is waiting for the data here, we need to make sure the other threads wait here as well. It might be necessary to keep a threadfence here as well, though IIRC syncwarp tends to do that implicitly, or at least all threads in the warp use the same cache, so any cache flush or similar done on one thread should impact all threads.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the other threads also has the same dependency here, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They inherit the dependency from the lane 0 because they wait for it

return __nvvm_get_smem_pointer(ptr);
#else
uint32 smem_ptr;
asm("{{ .reg .u64 smem_ptr; cvta.to.shared.u64 smem_ptr, %1; cvt.u32.u64 "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it is a stupid question: from https://docs.nvidia.com/cuda/inline-ptx-assembly/index.html#incorrect-optimization, volatile ensures it is not deleted or moved. I think the location of this ptx does not affect anything, but if it is deleted? or does the delete possibility only happen in combination(optimization) or no output?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it can only be deleted if the optimizer manages to remove the dependency on the value the assembly computes. Since the following load/store is volatile, it cannot be optimized away.

cuda/components/memory.cuh Show resolved Hide resolved
cuda/components/memory.cuh Outdated Show resolved Hide resolved
#include "common/cuda_hip/components/memory.hpp.inc"


__device__ __forceinline__ int32 load_relaxed_shared(const int32* ptr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the following codes are repeated up to the type/shared and the corresponding PTX (except for complex)
maybe macros like LOAD_ACQUIRE(TYPE, PTX_TYPE) -> give load_* and load_*_shared, which may require moving CUDA_ARCH macro out of this kind of macro.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, you did that in python. Isn't macro enough for that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a macro would be worse in terms of readability, since we are building strings out of many different components. Python allows us to at least give everything names in the template and parameter set (i.e. both where we define the macro and where we call it).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

macro should still give name for the parameter although you need to specify all instantiation manually not from for loop.
The python was an issue for me about the generated code and source.
I first review this file and try to figure out whether there's a missing combination. Then figure out there's another python file for it. There's no strong connection between python and generated code especially when it is the final code not the intermediate state.
Could you at least add some comment about the code is generated by the python file?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what I mean is that in Python, you have named arguments space(ptx_space_suffix=".shared", ...) while in preprocessor macros, you only have the argument order SPACE(.shared, ...), which is harder to maintain and read.

cuda/components/memory.cuh Outdated Show resolved Hide resolved
dev_tools/scripts/generate_cuda_memory_ptx.py Outdated Show resolved Hide resolved
@upsj upsj requested a review from yhmtsai October 9, 2023 09:34
#include "common/cuda_hip/components/memory.hpp.inc"


__device__ __forceinline__ int32 load_relaxed_shared(const int32* ptr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

macro should still give name for the parameter although you need to specify all instantiation manually not from for loop.
The python was an issue for me about the generated code and source.
I first review this file and try to figure out whether there's a missing combination. Then figure out there's another python file for it. There's no strong connection between python and generated code especially when it is the final code not the intermediate state.
Could you at least add some comment about the code is generated by the python file?

}
}
__threadfence();
group::tiled_partition<subwarp_size>(group::this_thread_block()).sync();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the other threads also has the same dependency here, right?

upsj and others added 6 commits October 10, 2023 10:01
- const-correctness
- add doc to generic-to-shared ptr conversion
- improve generation script readability

Co-authored-by: Marcel Koch <[email protected]>
Co-authored-by: Thomas Grützmacher <[email protected]>
- update asm type annotations
- fix incorrect store

Co-authored-by: Yuhsiang M. Tsai <[email protected]>
@upsj upsj added 1:ST:no-changelog-entry Skip the wiki check for changelog update 1:ST:ready-to-merge This PR is ready to merge. and removed 1:ST:ready-for-review This PR is ready for review labels Oct 10, 2023
@upsj
Copy link
Member Author

upsj commented Oct 10, 2023

I'll go ahead and merge this already, since only DPC++ and OpenMP pipelines are outstanding, and those files were unmodified. Then we can move forward the other PRs soon.

@upsj upsj merged commit 6f65404 into develop Oct 10, 2023
12 of 14 checks passed
@upsj upsj deleted the tune_cholesky branch October 10, 2023 11:59
@sonarcloud
Copy link

sonarcloud bot commented Oct 11, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
No Duplication information No Duplication information

warning The version of Java (11.0.3) you have used to run this analysis is deprecated and we will stop accepting it soon. Please update to at least Java 17.
Read more here

upsj added a commit that referenced this pull request Nov 30, 2023
As a follow-up to #1344,
this replaces `volatile` operations by proper memory ordering in HIP.

Related PR: #1472
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1:ST:no-changelog-entry Skip the wiki check for changelog update 1:ST:ready-to-merge This PR is ready to merge. mod:cuda This is related to the CUDA module. mod:hip This is related to the HIP module. mod:reference This is related to the reference module. reg:helper-scripts This issue/PR is related to the helper scripts mainly concerned with development of Ginkgo. reg:testing This is related to testing. type:factorization This is related to the Factorizations type:solver This is related to the solvers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants