Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix misaligned addresses with batched loggers #1578

Merged
merged 10 commits into from
May 16, 2024
Merged

Conversation

pratikvn
Copy link
Member

@pratikvn pratikvn commented Mar 22, 2024

This PR fixes errors with cuda misaligned addresses in the batched loggers when mixing 32 bit and 64 bit types reported in #1576. It adds a setup to allow aliasing workspace pointers with arrays of different types, which is necessary to prevent repeated allocations while preventing misaligned issues.

Fixes #1576

@pratikvn pratikvn added 1:ST:ready-for-review This PR is ready for review type:batched-functionality This is related to the batched functionality in Ginkgo is:bugfix This fixes a bug labels Mar 22, 2024
@pratikvn pratikvn self-assigned this Mar 22, 2024
@ginkgo-bot ginkgo-bot added mod:core This is related to the core module. mod:cuda This is related to the CUDA module. mod:reference This is related to the reference module. type:solver This is related to the solvers mod:hip This is related to the HIP module. mod:dpcpp This is related to the DPC++ module. labels Mar 22, 2024
@pratikvn pratikvn requested a review from a team March 22, 2024 13:17
Copy link
Member

@upsj upsj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using 64 bit indices doesn't seem to make any sense for batched problems to me. This looks more like a quick patch than a proper fix. I'll try to prototype an example for how we could improve upon this, I think CUB provides a nice blueprint.

@pratikvn
Copy link
Member Author

In this case the maximum number of elements possible is equal to the number of batch items. So, I think it is definitely possible to have more than max(int32) for that. Additionally, this is not a shared memory issue, but an issue of using a workspace and storing pointers to different types within the workspace that was causing the misaligned accesses.

Nevertheless, your suggestion on the CUB approach is a nice way to do it, and I will look into that.

@MarcelKoch MarcelKoch added this to the Ginkgo 1.8.0 milestone Apr 5, 2024
@upsj
Copy link
Member

upsj commented Apr 9, 2024

This change only deals with final iteration counts, which is where IMO 64 bit integers make no sense.

@pratikvn
Copy link
Member Author

pratikvn commented Apr 9, 2024

Yes, only the final iteration counts are being logged, but the number of entries scales with the number of batch items that are being solved, which can possibly more than max(int32)

@upsj
Copy link
Member

upsj commented Apr 9, 2024

The type you use to index an array is independent of the value type of the array, I am talking about the value type

@pratikvn
Copy link
Member Author

pratikvn commented Apr 9, 2024

Ah, yes. I misunderstood your comment. I am looking into the CUB blueprint. That should be usable for correct aligntment of both the workspace vectors and for the shared memory objects.

@pratikvn pratikvn force-pushed the fix-logger-misaligned branch 3 times, most recently from ac14951 to 0afc369 Compare April 23, 2024 21:11
@MarcelKoch
Copy link
Member

I think the PR description and title need to change now.

@pratikvn
Copy link
Member Author

I think both the title and description are still valid. I have updated the description now.

Copy link
Member

@MarcelKoch MarcelKoch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we maybe first summarize the issue? If I understand it correctly, the issue was the res_norm pointer in batch_logger.hpp. The alignment of that pointer depends on how many batch items there are, and the size in bytes of the iter_counts, which are using the workspace memory before the res_norm. In some cases (num_batch_items==1) this could lead to an alignment for res_norm less than 8 bytes.

The workspace manager from CUB fixes the alignment by ensuring that each memory region from the work space is aligned to at least 8 bytes.

TBH, the CUB implementation seems a bit overkill for us. We just need a function that like

T* next_aligned(void* ptr)

which gives you a pointer to the next aligned (to 8 bytes I guess) memory location.

If the CUB implementation should stay, it needs a lot more documentation. I'm also not a fan of the chain layout->slot->alias, it could be layout->alias directly.

include/ginkgo/core/base/workspace_aliases.hpp Outdated Show resolved Hide resolved
@@ -51,17 +51,22 @@ struct log_data final {
array<unsigned char>& workspace)
: res_norms(exec), iter_counts(exec)
{
const size_type workspace_size =
num_batch_items * (sizeof(real_type) + sizeof(int));
const size_type workspace_size = num_batch_items * 32;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that * 32 because of the warp size, or where does this come from?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we just want to ensure enough of a large workspace. In this case, we align everything to 8 bytes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will take a look at this PR on Wednesday, but I can already tell you that thrust::complex<double> needs an alignment of 16 bytes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, we are aligning to the max real floating type, (should be of remove_complex<T>), which is 8. But for the general case, yes, I guess we would need to align to 16.

include/ginkgo/core/solver/batch_solver_base.hpp Outdated Show resolved Hide resolved
include/ginkgo/core/base/workspace_aliases.hpp Outdated Show resolved Hide resolved
include/ginkgo/core/log/batch_logger.hpp Outdated Show resolved Hide resolved
Comment on lines +39 to +47
#if defined(__CUDACC__)
#define GKO_DEVICE_ERROR_TYPE cudaError_t
#define GKO_DEVICE_ERROR_INVALID cudaErrorInvalidValue
#define GKO_DEVICE_NO_ERROR cudaSuccess
#elif defined(__HIPCC__)
#define GKO_DEVICE_ERROR_TYPE hipError_t
#define GKO_DEVICE_ERROR_INVALID hipErrorInvalidValue
#define GKO_DEVICE_NO_ERROR hipSuccess
#else
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since only succeess/failure is needed, you don't need these wrappers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The plan is to use these for aliasing the shared memory pointers in the device kernels as well. In those cases, we would need the return code which would be useful in narrowing the issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, does this have to be in a public header? Wouldn't a core header be sufficient?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have the same question as MarcelKoch. the current usage only require these alias in the core header not public.

include/ginkgo/core/log/batch_logger.hpp Outdated Show resolved Hide resolved
core/log/batch_logger.cpp Outdated Show resolved Hide resolved
@MarcelKoch MarcelKoch requested a review from upsj May 3, 2024 09:28
@pratikvn pratikvn requested a review from MarcelKoch May 10, 2024 12:50
.pre-commit-config.yaml Outdated Show resolved Hide resolved
.pre-commit-config.yaml Outdated Show resolved Hide resolved
core/log/batch_logger.cpp Outdated Show resolved Hide resolved
Comment on lines +39 to +47
#if defined(__CUDACC__)
#define GKO_DEVICE_ERROR_TYPE cudaError_t
#define GKO_DEVICE_ERROR_INVALID cudaErrorInvalidValue
#define GKO_DEVICE_NO_ERROR cudaSuccess
#elif defined(__HIPCC__)
#define GKO_DEVICE_ERROR_TYPE hipError_t
#define GKO_DEVICE_ERROR_INVALID hipErrorInvalidValue
#define GKO_DEVICE_NO_ERROR hipSuccess
#else
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, does this have to be in a public header? Wouldn't a core header be sufficient?

@pratikvn
Copy link
Member Author

the check-format will fail because it uses develop instead of the updated precommit file

@pratikvn pratikvn requested a review from MarcelKoch May 11, 2024 10:39
const size_t num_batch_items = 2;
const size_t num_batch_items = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason for that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, with num_batch_items=1 and without this fix, current develop gives the error

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait, I thought the error is from cuda not from reference. did the error also happen in reference?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right. I somehow pushed it in the wrong file. It only fails in cuda and not reference.

Comment on lines +39 to +47
#if defined(__CUDACC__)
#define GKO_DEVICE_ERROR_TYPE cudaError_t
#define GKO_DEVICE_ERROR_INVALID cudaErrorInvalidValue
#define GKO_DEVICE_NO_ERROR cudaSuccess
#elif defined(__HIPCC__)
#define GKO_DEVICE_ERROR_TYPE hipError_t
#define GKO_DEVICE_ERROR_INVALID hipErrorInvalidValue
#define GKO_DEVICE_NO_ERROR hipSuccess
#else
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have the same question as MarcelKoch. the current usage only require these alias in the core header not public.

Comment on lines +30 to +36
// This code is a modified version of the code from CCCL
// (https://github.com/NVIDIA/cccl) (cub/detail/temporary_storage.cuh and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe also mention what the modified part?
I guess only namespace, error code

Comment on lines -58 to -64
iter_counts =
array<int>::view(exec, num_batch_items,
reinterpret_cast<int*>(workspace.get_data()));
res_norms = array<real_type>::view(
exec, num_batch_items,
reinterpret_cast<real_type*>(workspace.get_data() +
(sizeof(int) * num_batch_items)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

originally, I thought it should be solved by using the view on the larger type first.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I havent tested that, but it is possible, but for cases when we need to have different types, it would not help. This approach is more general.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one if-else condition should solve all cases

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The plan would be to use this in all cases, for shared memory inside the kernels as well, so IMO CUB solution is a better and more general approach.

array<unsigned char>& workspace)
: res_norms(exec), iter_counts(exec)
{
const size_type reqd_workspace_size = num_batch_items * 32;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand it correctly, the mapping is to ensure the memory address is divisible by the size of data type.
It's also why I think it was enough by viewing the larger data type first.
Assume the workspace data is not aliased from the beginning, which should not be the case if the workspace is the alone allocation, and then the actual memory will need a little more than reqd_workspace_size by cutting the first unaligned part. If the reqd_workspace_size is the same as the workspace allocation, it does not help when the workspace is not aligned from the beginning.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, I may misunderstand the reqd_workspace_size meaning. I thought it is required workspace size. That's why I thought it will be failed when it is not aligned. The actual size is larger than required size == workspace size
However, it is just the size of workspace pointer in create_workspace_aliases.
It will be required workspace size (by function) when the aliases phase is failed.

@yhmtsai
Copy link
Member

yhmtsai commented May 11, 2024

By the way, do you have the test such that we face the issue in our pipeline?

@MarcelKoch
Copy link
Member

the check-format will fail because it uses develop instead of the updated precommit file

This is always so annoying....

Copy link
Member

@yhmtsai yhmtsai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. the name reqd_workspace_size is still confusing to me.
Because it has different meaning in function input/output, I do not have idea about the better name.
Also, the workspace allocation in solver for log_data should get the size from log_data (at least by comments).

core/base/workspace_aliases.hpp Outdated Show resolved Hide resolved
array<unsigned char>& workspace)
: res_norms(exec), iter_counts(exec)
{
const size_type reqd_workspace_size = num_batch_items * 32;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, I may misunderstand the reqd_workspace_size meaning. I thought it is required workspace size. That's why I thought it will be failed when it is not aligned. The actual size is larger than required size == workspace size
However, it is just the size of workspace pointer in create_workspace_aliases.
It will be required workspace size (by function) when the aliases phase is failed.

array<unsigned char>& workspace)
: res_norms(exec), iter_counts(exec)
{
const size_type reqd_workspace_size = num_batch_items * 32;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const size_type reqd_workspace_size = num_batch_items * 32;
// it should at least `num * (sizeof(real_type) + sizeof(int))` with some additional buffer for alias purpose, but we simply request large enough size here.
const size_type reqd_workspace_size = num_batch_items * 32;

const size_type workspace_size = system_matrix->get_num_batch_items() *
(sizeof(real_type) + sizeof(int));
const size_type workspace_size =
system_matrix->get_num_batch_items() * 32;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe replaced it by log_data->get_workspace_size(num_batch)? It may help understanding for the future.
At least, some comments here to indicate what this workspace for.

@pratikvn pratikvn added 1:ST:run-full-test 1:ST:ready-to-merge This PR is ready to merge. and removed 1:ST:ready-for-review This PR is ready for review labels May 12, 2024
private:
GKO_ATTRIBUTES void set_bytes_required(std::size_t new_size)
{
size_ = max(size_, new_size);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this causes an error in our no-circular-deps test, because nothing provides the max function. I think it should be fine to use std::max. It should even be available in device code, since it's constexpr since c++14. If that is somehow not the case, then removing the GKO_ATTRIBUTES should be fine, since this functionality is not used in device codes anyway.

@ginkgo-bot
Copy link
Member

Error: The following files need to be formatted:

core/base/workspace_aliases.hpp

You can find a formatting patch under Artifacts here or run format! if you have write access to Ginkgo

@pratikvn pratikvn merged commit e8af940 into develop May 16, 2024
11 of 15 checks passed
@pratikvn pratikvn deleted the fix-logger-misaligned branch May 16, 2024 19:35
Copy link

sonarcloud bot commented May 16, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1:ST:ready-to-merge This PR is ready to merge. 1:ST:run-full-test is:bugfix This fixes a bug mod:core This is related to the core module. mod:cuda This is related to the CUDA module. mod:dpcpp This is related to the DPC++ module. mod:hip This is related to the HIP module. mod:reference This is related to the reference module. type:batched-functionality This is related to the batched functionality in Ginkgo type:solver This is related to the solvers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CUDA error with batched solvers
5 participants