Fix misaligned addresses with batched loggers #1578

pratikvn · 2024-03-22T12:56:44Z

This PR fixes errors with cuda misaligned addresses in the batched loggers when mixing 32 bit and 64 bit types reported in #1576. It adds a setup to allow aliasing workspace pointers with arrays of different types, which is necessary to prevent repeated allocations while preventing misaligned issues.

Fixes #1576

include/ginkgo/core/log/batch_logger.hpp

upsj

Using 64 bit indices doesn't seem to make any sense for batched problems to me. This looks more like a quick patch than a proper fix. I'll try to prototype an example for how we could improve upon this, I think CUB provides a nice blueprint.

pratikvn · 2024-03-25T08:53:15Z

In this case the maximum number of elements possible is equal to the number of batch items. So, I think it is definitely possible to have more than max(int32) for that. Additionally, this is not a shared memory issue, but an issue of using a workspace and storing pointers to different types within the workspace that was causing the misaligned accesses.

Nevertheless, your suggestion on the CUB approach is a nice way to do it, and I will look into that.

upsj · 2024-04-09T10:21:21Z

This change only deals with final iteration counts, which is where IMO 64 bit integers make no sense.

pratikvn · 2024-04-09T10:40:40Z

Yes, only the final iteration counts are being logged, but the number of entries scales with the number of batch items that are being solved, which can possibly more than max(int32)

upsj · 2024-04-09T10:44:44Z

The type you use to index an array is independent of the value type of the array, I am talking about the value type

pratikvn · 2024-04-09T15:11:09Z

Ah, yes. I misunderstood your comment. I am looking into the CUB blueprint. That should be usable for correct aligntment of both the workspace vectors and for the shared memory objects.

MarcelKoch · 2024-04-24T08:41:33Z

I think the PR description and title need to change now.

pratikvn · 2024-04-24T08:53:38Z

I think both the title and description are still valid. I have updated the description now.

MarcelKoch

Can we maybe first summarize the issue? If I understand it correctly, the issue was the res_norm pointer in batch_logger.hpp. The alignment of that pointer depends on how many batch items there are, and the size in bytes of the iter_counts, which are using the workspace memory before the res_norm. In some cases (num_batch_items==1) this could lead to an alignment for res_norm less than 8 bytes.

The workspace manager from CUB fixes the alignment by ensuring that each memory region from the work space is aligned to at least 8 bytes.

TBH, the CUB implementation seems a bit overkill for us. We just need a function that like

T* next_aligned(void* ptr)

which gives you a pointer to the next aligned (to 8 bytes I guess) memory location.

If the CUB implementation should stay, it needs a lot more documentation. I'm also not a fan of the chain layout->slot->alias, it could be layout->alias directly.

include/ginkgo/core/base/workspace_aliases.hpp

MarcelKoch · 2024-04-24T13:25:51Z

include/ginkgo/core/log/batch_logger.hpp

@@ -51,17 +51,22 @@ struct log_data final {
             array<unsigned char>& workspace)
        : res_norms(exec), iter_counts(exec)
    {
-        const size_type workspace_size =
-            num_batch_items * (sizeof(real_type) + sizeof(int));
+        const size_type workspace_size = num_batch_items * 32;


Is that * 32 because of the warp size, or where does this come from?

No, we just want to ensure enough of a large workspace. In this case, we align everything to 8 bytes.

I will take a look at this PR on Wednesday, but I can already tell you that thrust::complex<double> needs an alignment of 16 bytes

In this case, we are aligning to the max real floating type, (should be of remove_complex<T>), which is 8. But for the general case, yes, I guess we would need to align to 16.

include/ginkgo/core/solver/batch_solver_base.hpp

include/ginkgo/core/base/workspace_aliases.hpp

include/ginkgo/core/log/batch_logger.hpp

MarcelKoch · 2024-04-24T14:05:12Z

include/ginkgo/core/base/types.hpp

+#if defined(__CUDACC__)
+#define GKO_DEVICE_ERROR_TYPE cudaError_t
+#define GKO_DEVICE_ERROR_INVALID cudaErrorInvalidValue
+#define GKO_DEVICE_NO_ERROR cudaSuccess
+#elif defined(__HIPCC__)
+#define GKO_DEVICE_ERROR_TYPE hipError_t
+#define GKO_DEVICE_ERROR_INVALID hipErrorInvalidValue
+#define GKO_DEVICE_NO_ERROR hipSuccess
+#else


Since only succeess/failure is needed, you don't need these wrappers.

The plan is to use these for aliasing the shared memory pointers in the device kernels as well. In those cases, we would need the return code which would be useful in narrowing the issue.

btw, does this have to be in a public header? Wouldn't a core header be sufficient?

have the same question as MarcelKoch. the current usage only require these alias in the core header not public.

include/ginkgo/core/log/batch_logger.hpp

core/log/batch_logger.cpp

.pre-commit-config.yaml

core/log/batch_logger.cpp

MarcelKoch · 2024-05-10T14:38:58Z

include/ginkgo/core/base/types.hpp

+#if defined(__CUDACC__)
+#define GKO_DEVICE_ERROR_TYPE cudaError_t
+#define GKO_DEVICE_ERROR_INVALID cudaErrorInvalidValue
+#define GKO_DEVICE_NO_ERROR cudaSuccess
+#elif defined(__HIPCC__)
+#define GKO_DEVICE_ERROR_TYPE hipError_t
+#define GKO_DEVICE_ERROR_INVALID hipErrorInvalidValue
+#define GKO_DEVICE_NO_ERROR hipSuccess
+#else


btw, does this have to be in a public header? Wouldn't a core header be sufficient?

pratikvn · 2024-05-10T19:41:57Z

the check-format will fail because it uses develop instead of the updated precommit file

yhmtsai · 2024-05-11T19:57:16Z

reference/test/solver/batch_bicgstab_kernels.cpp

-    const size_t num_batch_items = 2;
+    const size_t num_batch_items = 1;


any reason for that?

Yes, with num_batch_items=1 and without this fix, current develop gives the error

wait, I thought the error is from cuda not from reference. did the error also happen in reference?

Yes, you are right. I somehow pushed it in the wrong file. It only fails in cuda and not reference.

yhmtsai · 2024-05-11T20:04:34Z

include/ginkgo/core/base/types.hpp

+#if defined(__CUDACC__)
+#define GKO_DEVICE_ERROR_TYPE cudaError_t
+#define GKO_DEVICE_ERROR_INVALID cudaErrorInvalidValue
+#define GKO_DEVICE_NO_ERROR cudaSuccess
+#elif defined(__HIPCC__)
+#define GKO_DEVICE_ERROR_TYPE hipError_t
+#define GKO_DEVICE_ERROR_INVALID hipErrorInvalidValue
+#define GKO_DEVICE_NO_ERROR hipSuccess
+#else


have the same question as MarcelKoch. the current usage only require these alias in the core header not public.

yhmtsai · 2024-05-11T20:13:25Z

core/base/workspace_aliases.hpp

+// This code is a modified version of the code from CCCL
+// (https://github.com/NVIDIA/cccl) (cub/detail/temporary_storage.cuh and


maybe also mention what the modified part?
I guess only namespace, error code

yhmtsai · 2024-05-11T20:16:00Z

include/ginkgo/core/log/batch_logger.hpp

-            iter_counts =
-                array<int>::view(exec, num_batch_items,
-                                 reinterpret_cast<int*>(workspace.get_data()));
-            res_norms = array<real_type>::view(
-                exec, num_batch_items,
-                reinterpret_cast<real_type*>(workspace.get_data() +
-                                             (sizeof(int) * num_batch_items)));


originally, I thought it should be solved by using the view on the larger type first.

I havent tested that, but it is possible, but for cases when we need to have different types, it would not help. This approach is more general.

one if-else condition should solve all cases

The plan would be to use this in all cases, for shared memory inside the kernels as well, so IMO CUB solution is a better and more general approach.

yhmtsai · 2024-05-11T20:30:37Z

core/log/batch_logger.cpp

+                              array<unsigned char>& workspace)
+    : res_norms(exec), iter_counts(exec)
+{
+    const size_type reqd_workspace_size = num_batch_items * 32;


If I understand it correctly, the mapping is to ensure the memory address is divisible by the size of data type.
It's also why I think it was enough by viewing the larger data type first.
Assume the workspace data is not aliased from the beginning, which should not be the case if the workspace is the alone allocation, and then the actual memory will need a little more than reqd_workspace_size by cutting the first unaligned part. If the reqd_workspace_size is the same as the workspace allocation, it does not help when the workspace is not aligned from the beginning.

okay, I may misunderstand the reqd_workspace_size meaning. I thought it is required workspace size. That's why I thought it will be failed when it is not aligned. The actual size is larger than required size == workspace size
However, it is just the size of workspace pointer in create_workspace_aliases.
It will be required workspace size (by function) when the aliases phase is failed.

yhmtsai · 2024-05-11T20:32:01Z

By the way, do you have the test such that we face the issue in our pipeline?

MarcelKoch · 2024-05-12T07:28:46Z

the check-format will fail because it uses develop instead of the updated precommit file

This is always so annoying....

yhmtsai

LGTM. the name reqd_workspace_size is still confusing to me.
Because it has different meaning in function input/output, I do not have idea about the better name.
Also, the workspace allocation in solver for log_data should get the size from log_data (at least by comments).

core/base/workspace_aliases.hpp

yhmtsai · 2024-05-12T19:26:46Z

core/log/batch_logger.cpp

+                              array<unsigned char>& workspace)
+    : res_norms(exec), iter_counts(exec)
+{
+    const size_type reqd_workspace_size = num_batch_items * 32;


okay, I may misunderstand the reqd_workspace_size meaning. I thought it is required workspace size. That's why I thought it will be failed when it is not aligned. The actual size is larger than required size == workspace size
However, it is just the size of workspace pointer in create_workspace_aliases.
It will be required workspace size (by function) when the aliases phase is failed.

yhmtsai · 2024-05-12T19:29:44Z

core/log/batch_logger.cpp

+                              array<unsigned char>& workspace)
+    : res_norms(exec), iter_counts(exec)
+{
+    const size_type reqd_workspace_size = num_batch_items * 32;


Suggested change

const size_type reqd_workspace_size = num_batch_items * 32;

// it should at least `num * (sizeof(real_type) + sizeof(int))` with some additional buffer for alias purpose, but we simply request large enough size here.

const size_type reqd_workspace_size = num_batch_items * 32;

yhmtsai · 2024-05-12T19:34:05Z

include/ginkgo/core/solver/batch_solver_base.hpp

-        const size_type workspace_size = system_matrix->get_num_batch_items() *
-                                         (sizeof(real_type) + sizeof(int));
+        const size_type workspace_size =
+            system_matrix->get_num_batch_items() * 32;


maybe replaced it by log_data->get_workspace_size(num_batch)? It may help understanding for the future.
At least, some comments here to indicate what this workspace for.

MarcelKoch · 2024-05-16T08:48:40Z

core/base/workspace_aliases.hpp

+private:
+    GKO_ATTRIBUTES void set_bytes_required(std::size_t new_size)
+    {
+        size_ = max(size_, new_size);


this causes an error in our no-circular-deps test, because nothing provides the max function. I think it should be fine to use std::max. It should even be available in device code, since it's constexpr since c++14. If that is somehow not the case, then removing the GKO_ATTRIBUTES should be fine, since this functionality is not used in device codes anyway.

Co-authored-by: Marcel Koch <[email protected]>

Co-authored-by: Yu-Hsiang Tsai <[email protected]>

ginkgo-bot · 2024-05-16T12:12:27Z

Error: The following files need to be formatted:

core/base/workspace_aliases.hpp

You can find a formatting patch under Artifacts here or run format! if you have write access to Ginkgo

sonarcloud · 2024-05-16T22:40:28Z

Quality Gate passed

Issues
19 New issues
0 Accepted issues

Measures
0 Security Hotspots
87.4% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

pratikvn added 1:ST:ready-for-review This PR is ready for review type:batched-functionality This is related to the batched functionality in Ginkgo is:bugfix This fixes a bug labels Mar 22, 2024

pratikvn requested review from upsj and MarcelKoch March 22, 2024 12:56

pratikvn self-assigned this Mar 22, 2024

pratikvn mentioned this pull request Mar 22, 2024

CUDA error with batched solvers #1576

Closed

pratikvn force-pushed the fix-logger-misaligned branch from 1cfe7e8 to afd35e5 Compare March 22, 2024 13:13

pratikvn requested a review from a team March 22, 2024 13:17

yhmtsai reviewed Mar 22, 2024

View reviewed changes

include/ginkgo/core/log/batch_logger.hpp Outdated Show resolved Hide resolved

upsj reviewed Mar 25, 2024

View reviewed changes

MarcelKoch added this to the Ginkgo 1.8.0 milestone Apr 5, 2024

pratikvn force-pushed the fix-logger-misaligned branch 3 times, most recently from ac14951 to 0afc369 Compare April 23, 2024 21:11

MarcelKoch previously requested changes Apr 24, 2024

View reviewed changes

MarcelKoch requested a review from upsj May 3, 2024 09:28

pratikvn requested a review from MarcelKoch May 10, 2024 12:50

MarcelKoch requested changes May 10, 2024

View reviewed changes

pratikvn force-pushed the fix-logger-misaligned branch from bc51458 to 051fb6d Compare May 10, 2024 19:39

pratikvn requested a review from MarcelKoch May 11, 2024 10:39

yhmtsai reviewed May 11, 2024

View reviewed changes

MarcelKoch approved these changes May 12, 2024

View reviewed changes

yhmtsai approved these changes May 12, 2024

View reviewed changes

pratikvn added 1:ST:run-full-test 1:ST:ready-to-merge This PR is ready to merge. and removed 1:ST:ready-for-review This PR is ready for review labels May 12, 2024

pratikvn force-pushed the fix-logger-misaligned branch from 85a70b8 to 1f41636 Compare May 15, 2024 10:29

MarcelKoch reviewed May 16, 2024

View reviewed changes

pratikvn and others added 10 commits May 16, 2024 14:09

Fix misaligned add for log by storing 64bit int

256d52d

Add workspace aliasing and use int in logger

790801a

add align_bytes as template parameter

770c279

move to workspace_aliases to internal header

a6c6bca

Co-authored-by: Marcel Koch <[email protected]>

update license header

e05eb5b

review updates

9a72323

Co-authored-by: Marcel Koch <[email protected]>

update to failing test on develop, review updates

79aad03

Co-authored-by: Yu-Hsiang Tsai <[email protected]>

review updates

59516d4

Co-authored-by: Yu-Hsiang Tsai <[email protected]>

add missing include files

2d28b98

fix max issue

5f5967e

pratikvn force-pushed the fix-logger-misaligned branch from 1f41636 to 5f5967e Compare May 16, 2024 12:11

pratikvn merged commit e8af940 into develop May 16, 2024
11 of 15 checks passed

pratikvn deleted the fix-logger-misaligned branch May 16, 2024 19:35

		const size_t num_batch_items = 2;
		const size_t num_batch_items = 1;

		// This code is a modified version of the code from CCCL
		// (https://github.com/NVIDIA/cccl) (cub/detail/temporary_storage.cuh and

	const size_type reqd_workspace_size = num_batch_items * 32;
	// it should at least `num * (sizeof(real_type) + sizeof(int))` with some additional buffer for alias purpose, but we simply request large enough size here.
	const size_type reqd_workspace_size = num_batch_items * 32;

Fix misaligned addresses with batched loggers #1578

Fix misaligned addresses with batched loggers #1578

Conversation

pratikvn commented Mar 22, 2024 • edited Loading

upsj left a comment

Choose a reason for hiding this comment

pratikvn commented Mar 25, 2024

upsj commented Apr 9, 2024

pratikvn commented Apr 9, 2024

upsj commented Apr 9, 2024

pratikvn commented Apr 9, 2024

MarcelKoch commented Apr 24, 2024

pratikvn commented Apr 24, 2024

MarcelKoch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pratikvn commented May 10, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yhmtsai commented May 11, 2024

MarcelKoch commented May 12, 2024

yhmtsai left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ginkgo-bot commented May 16, 2024

sonarcloud bot commented May 16, 2024

Quality Gate passed

pratikvn commented Mar 22, 2024 •

edited

Loading

yhmtsai left a comment •

edited

Loading