-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix misaligned addresses with batched loggers #1578
Conversation
1cfe7e8
to
afd35e5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using 64 bit indices doesn't seem to make any sense for batched problems to me. This looks more like a quick patch than a proper fix. I'll try to prototype an example for how we could improve upon this, I think CUB provides a nice blueprint.
In this case the maximum number of elements possible is equal to the number of batch items. So, I think it is definitely possible to have more than max(int32) for that. Additionally, this is not a shared memory issue, but an issue of using a workspace and storing pointers to different types within the workspace that was causing the misaligned accesses. Nevertheless, your suggestion on the CUB approach is a nice way to do it, and I will look into that. |
This change only deals with final iteration counts, which is where IMO 64 bit integers make no sense. |
Yes, only the final iteration counts are being logged, but the number of entries scales with the number of batch items that are being solved, which can possibly more than |
The type you use to index an array is independent of the value type of the array, I am talking about the value type |
Ah, yes. I misunderstood your comment. I am looking into the CUB blueprint. That should be usable for correct aligntment of both the workspace vectors and for the shared memory objects. |
ac14951
to
0afc369
Compare
I think the PR description and title need to change now. |
I think both the title and description are still valid. I have updated the description now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we maybe first summarize the issue? If I understand it correctly, the issue was the res_norm
pointer in batch_logger.hpp
. The alignment of that pointer depends on how many batch items there are, and the size in bytes of the iter_counts
, which are using the workspace memory before the res_norm
. In some cases (num_batch_items==1
) this could lead to an alignment for res_norm
less than 8 bytes.
The workspace manager from CUB fixes the alignment by ensuring that each memory region from the work space is aligned to at least 8 bytes.
TBH, the CUB implementation seems a bit overkill for us. We just need a function that like
T* next_aligned(void* ptr)
which gives you a pointer to the next aligned (to 8 bytes I guess) memory location.
If the CUB implementation should stay, it needs a lot more documentation. I'm also not a fan of the chain layout->slot->alias
, it could be layout->alias
directly.
@@ -51,17 +51,22 @@ struct log_data final { | |||
array<unsigned char>& workspace) | |||
: res_norms(exec), iter_counts(exec) | |||
{ | |||
const size_type workspace_size = | |||
num_batch_items * (sizeof(real_type) + sizeof(int)); | |||
const size_type workspace_size = num_batch_items * 32; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that * 32
because of the warp size, or where does this come from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we just want to ensure enough of a large workspace. In this case, we align everything to 8 bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will take a look at this PR on Wednesday, but I can already tell you that thrust::complex<double>
needs an alignment of 16 bytes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, we are aligning to the max real floating type, (should be of remove_complex<T>
), which is 8. But for the general case, yes, I guess we would need to align to 16.
#if defined(__CUDACC__) | ||
#define GKO_DEVICE_ERROR_TYPE cudaError_t | ||
#define GKO_DEVICE_ERROR_INVALID cudaErrorInvalidValue | ||
#define GKO_DEVICE_NO_ERROR cudaSuccess | ||
#elif defined(__HIPCC__) | ||
#define GKO_DEVICE_ERROR_TYPE hipError_t | ||
#define GKO_DEVICE_ERROR_INVALID hipErrorInvalidValue | ||
#define GKO_DEVICE_NO_ERROR hipSuccess | ||
#else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since only succeess/failure is needed, you don't need these wrappers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The plan is to use these for aliasing the shared memory pointers in the device kernels as well. In those cases, we would need the return code which would be useful in narrowing the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw, does this have to be in a public header? Wouldn't a core header be sufficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have the same question as MarcelKoch. the current usage only require these alias in the core header not public.
#if defined(__CUDACC__) | ||
#define GKO_DEVICE_ERROR_TYPE cudaError_t | ||
#define GKO_DEVICE_ERROR_INVALID cudaErrorInvalidValue | ||
#define GKO_DEVICE_NO_ERROR cudaSuccess | ||
#elif defined(__HIPCC__) | ||
#define GKO_DEVICE_ERROR_TYPE hipError_t | ||
#define GKO_DEVICE_ERROR_INVALID hipErrorInvalidValue | ||
#define GKO_DEVICE_NO_ERROR hipSuccess | ||
#else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw, does this have to be in a public header? Wouldn't a core header be sufficient?
bc51458
to
051fb6d
Compare
the check-format will fail because it uses develop instead of the updated precommit file |
const size_t num_batch_items = 2; | ||
const size_t num_batch_items = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, with num_batch_items=1 and without this fix, current develop gives the error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait, I thought the error is from cuda not from reference. did the error also happen in reference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right. I somehow pushed it in the wrong file. It only fails in cuda and not reference.
#if defined(__CUDACC__) | ||
#define GKO_DEVICE_ERROR_TYPE cudaError_t | ||
#define GKO_DEVICE_ERROR_INVALID cudaErrorInvalidValue | ||
#define GKO_DEVICE_NO_ERROR cudaSuccess | ||
#elif defined(__HIPCC__) | ||
#define GKO_DEVICE_ERROR_TYPE hipError_t | ||
#define GKO_DEVICE_ERROR_INVALID hipErrorInvalidValue | ||
#define GKO_DEVICE_NO_ERROR hipSuccess | ||
#else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have the same question as MarcelKoch. the current usage only require these alias in the core header not public.
// This code is a modified version of the code from CCCL | ||
// (https://github.com/NVIDIA/cccl) (cub/detail/temporary_storage.cuh and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe also mention what the modified part?
I guess only namespace, error code
iter_counts = | ||
array<int>::view(exec, num_batch_items, | ||
reinterpret_cast<int*>(workspace.get_data())); | ||
res_norms = array<real_type>::view( | ||
exec, num_batch_items, | ||
reinterpret_cast<real_type*>(workspace.get_data() + | ||
(sizeof(int) * num_batch_items))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
originally, I thought it should be solved by using the view on the larger type first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I havent tested that, but it is possible, but for cases when we need to have different types, it would not help. This approach is more general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one if-else
condition should solve all cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The plan would be to use this in all cases, for shared memory inside the kernels as well, so IMO CUB solution is a better and more general approach.
array<unsigned char>& workspace) | ||
: res_norms(exec), iter_counts(exec) | ||
{ | ||
const size_type reqd_workspace_size = num_batch_items * 32; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand it correctly, the mapping is to ensure the memory address is divisible by the size of data type.
It's also why I think it was enough by viewing the larger data type first.
Assume the workspace data is not aliased from the beginning, which should not be the case if the workspace is the alone allocation, and then the actual memory will need a little more than reqd_workspace_size by cutting the first unaligned part. If the reqd_workspace_size is the same as the workspace allocation, it does not help when the workspace is not aligned from the beginning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, I may misunderstand the reqd_workspace_size
meaning. I thought it is required workspace size. That's why I thought it will be failed when it is not aligned. The actual size is larger than required size == workspace size
However, it is just the size of workspace pointer in create_workspace_aliases
.
It will be required workspace size (by function) when the aliases phase is failed.
By the way, do you have the test such that we face the issue in our pipeline? |
This is always so annoying.... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. the name reqd_workspace_size is still confusing to me.
Because it has different meaning in function input/output, I do not have idea about the better name.
Also, the workspace allocation in solver for log_data should get the size from log_data (at least by comments).
array<unsigned char>& workspace) | ||
: res_norms(exec), iter_counts(exec) | ||
{ | ||
const size_type reqd_workspace_size = num_batch_items * 32; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, I may misunderstand the reqd_workspace_size
meaning. I thought it is required workspace size. That's why I thought it will be failed when it is not aligned. The actual size is larger than required size == workspace size
However, it is just the size of workspace pointer in create_workspace_aliases
.
It will be required workspace size (by function) when the aliases phase is failed.
array<unsigned char>& workspace) | ||
: res_norms(exec), iter_counts(exec) | ||
{ | ||
const size_type reqd_workspace_size = num_batch_items * 32; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const size_type reqd_workspace_size = num_batch_items * 32; | |
// it should at least `num * (sizeof(real_type) + sizeof(int))` with some additional buffer for alias purpose, but we simply request large enough size here. | |
const size_type reqd_workspace_size = num_batch_items * 32; |
const size_type workspace_size = system_matrix->get_num_batch_items() * | ||
(sizeof(real_type) + sizeof(int)); | ||
const size_type workspace_size = | ||
system_matrix->get_num_batch_items() * 32; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe replaced it by log_data->get_workspace_size(num_batch)? It may help understanding for the future.
At least, some comments here to indicate what this workspace for.
85a70b8
to
1f41636
Compare
private: | ||
GKO_ATTRIBUTES void set_bytes_required(std::size_t new_size) | ||
{ | ||
size_ = max(size_, new_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this causes an error in our no-circular-deps test, because nothing provides the max
function. I think it should be fine to use std::max
. It should even be available in device code, since it's constexpr
since c++14. If that is somehow not the case, then removing the GKO_ATTRIBUTES
should be fine, since this functionality is not used in device codes anyway.
Co-authored-by: Marcel Koch <[email protected]>
Co-authored-by: Marcel Koch <[email protected]>
Co-authored-by: Yu-Hsiang Tsai <[email protected]>
Co-authored-by: Yu-Hsiang Tsai <[email protected]>
1f41636
to
5f5967e
Compare
Error: The following files need to be formatted:
You can find a formatting patch under Artifacts here or run |
Quality Gate passedIssues Measures |
This PR fixes errors with cuda misaligned addresses in the batched loggers when mixing 32 bit and 64 bit types reported in #1576. It adds a setup to allow aliasing workspace pointers with arrays of different types, which is necessary to prevent repeated allocations while preventing misaligned issues.
Fixes #1576