Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Makes UR cuda backend compatible with MPI #2077

Closed
wants to merge 3 commits into from

Conversation

JackAKirk
Copy link
Contributor

@JackAKirk JackAKirk commented Sep 10, 2024

Fixes intel/llvm#15251
providing that MPI/SYCL codes are updated as in codeplaysoftware/SYCL-samples#33

I will follow up with a corresponding fix to hip that is pretty much copy paste; I can add it to this PR, but that would require #1830 to be merged first.

Background

The origin of the above issue is that since we no longer have a single cuda device per platform, we currently have all devices initialize primary CUcontexts at platform instantiation even when this was not required by the runtime.
The MPI interface can work with cuda/rocm awareness because it is assumed that the user will set only a single hip/cuda device per process prior to each MPI call. The problem with having multiple cuda devices set (i.e their primary CUcontexts are instantiated) (as is the case if you use e.g. the default sycl::context and have multiple devices included), is that MPI calls will operate on each instantiated device, in the case that at least one active process has such a device primary CUcontext set.
This can lead to memory leaks as described in intel/llvm#15251

This PR fixes the issue by removing this platform scope CUcontext instantiation, and promoting sycl::context and ur context in the cuda backend to be responsible for instantiating and releasing the native CUcontexts associated with all devices included in that sycl::context. This works due to the universal usage of sycl::context and ur_context_handle_t_ in SYCL/UR apis. I think that this is the most natural solution for SYCL compatibility, and leads to only negligible one-time overhead in platform instantiation, but has the important benefit of supporting application critical technologies, MPI and *CCL, with SYCL.

Without considering MPI compatibilitysycl::context has been a free parameter (so to speak), that has not been functionally used in cuda/hip backends; it turns out that it is fortunate that it does exist, since otherwise there would not have been a straightforward way to incorporate MPI with SYCL as far as I can see.

Without changing UR/oneapi specs.

Signed-off-by: JackAKirk <[email protected]>
@github-actions github-actions bot added the cuda CUDA adapter specific issues label Sep 10, 2024
JackAKirk added a commit to JackAKirk/llvm that referenced this pull request Sep 10, 2024
@JackAKirk
Copy link
Contributor Author

tested here:

intel/llvm#15349

@github-actions github-actions bot added the conformance Conformance test suite issues. label Sep 11, 2024
@JackAKirk
Copy link
Contributor Author

JackAKirk commented Sep 12, 2024

Here is some more information on why this design choice was made, and comparison with alternatives.

Firstly we start with the requirement that we fix intel/llvm#15251
If we didn't do this then MPI wouldn't work with DPC++/SYCL, and I think this must also then mean it wouldn't be possible to make a oneCCL (which is based on MPI interfaces) nvidia backend for DPC++/SYCL.

Then we identify that the fix is to not allow CUcontext's to remain initialized which are associated with GPUs that are not being used by the programmer in the process that initialized them. This means that we have to change our way of working which currently initializes CUcontexts at platform instantiations, only destroying the CUcontexts in the device destructors that are called at platform destruction.
At this point we encounter the only spec issue regarding timestamps: (from sycl spec revision 9 4.6.6.1. "Event information and profiling descriptors")

Timestamp Requirement

"Each profiling descriptor returns a 64-bit timestamp that represents the number of nanoseconds that have elapsed since some implementation-defined timebase. All events that share the same backend are guaranteed to share the same timebase, therefore the difference between two timestamps from the same backend yields the number of nanoseconds that have elapsed between those events."

This is simply not implementable word for word (unless I am mistaken) in the cuda backend under the constraint that we are MPI compliant (in the following for brevity I will assume this constraint holds), for the following reasons:

  • CUevent needs an active CUcontext to be initialized
  • CUevent is invalid if the CUcontext that created it is destroyed
  • urDeviceGetGlobalTimestamp is not implementable in the cuda backend without a reference CUevent: see the interface of cuEventElapsedTime, that is the only way of returning a event timing difference and requires a reference CUevent.

Therefore we cannot create a "backend global" CUevent timestamp at backend/platform initialization.

However, in practice I don't think this can break any SYCL code with the change in this PR because people can only perform profiling once they have created a sycl::queue, and you cannot create a sycl::queue without a sycl::context. Therefore this change which initializes CUcontext on sycl::context construction, and destroys CUcontext on sycl::context destruction, in practice is really equivalent to the spec requirement. In fact this change somehow seems to actually fix the cuda backend testing of the urDeviceGetGlobalTimestamps UR conformance testing. I'm not actually sure why, but I do note that we previously had separate CUevent timestamp references for each device, in contradiction to the sycl spec.

At this point it is worth considering an alternative: imagine that we instead tie CUcontext instantialization to sycl::queue creation.
If we were to do this we would in practice break the Timestamp Requirement above: say for example a user creates two sycl::queues, and then performs profiling analysis using sycl::events` returned from submissions to these two queues. Since the two queues are using different reference timestamps, the profiling information will be wrong. We would have to give the implementation some means of knowing whether a queue was already created using the requested device, and if so some means of passing the CUevent created already to the new queue. This would certainly be possible, but it would be much more complex and also have some small performance overhead to creating queues. Plus if the second queue was created after the first queue was destroyed (which admittedly would be unusual I think), then the event created initially would be invalid; in which case it would be impossible for the two queues to have the same base timestamp.

In addition to sycl::queue instantiating CUcontext, we would also have to make sycl::kernel_bundle instantiate a CUcontext, since it is not tied to the queue. Now if you did this, at least for the dpc++ implementation I think that all core sycl functionality would be valid since SYCL things require either a sycl::queue or sycl::kernel_bundle to work.
Even in this "core 2020 sycl" world a downside of this approach would be that it would be possible to have programs (which may be quite unusual I guess), whereby queues/kernel_bundles (perhaps kernel_bundle case is more realistic) are continuously created and destroyed. If such a program had the circumstance that the last reference to the CUcontext was released, the CUcontext gets destroyed. There is a ~50ms overhead to either instantiate or destroy a CUcontext, so this would not be desirable. In theory for a program that creates a kernel_bundle, uses it, deletes it, then creates a new one ad ~infinitum, the overhead would be very substantial.

However even ignoring the above issues, the oneapi virtual memory extension does not take a queue argument, but its implementation requires setting CUcontext. This would mean we would have to add a failsafe CUcontext retention call as we have already had to make in this single device query case already: https://github.com/oneapi-src/unified-runtime/pull/2077/files#diff-641b75ae8137280ac68523353cbb6eb8059f8581b35261d7a96d179a478229bcR810. This would mean that we would have to have the similar possibility to that described above, whereby users use the virtual memory extension for a given device when a queue for that same device is not in scope leading to CUcontext initialization/destruction costs.

Summary

So there are 3 main reasons I see to prefer instantiation at context scope rather than queue/kernel_bundle scope:

  1. doesn't break profiling for multiple queues without necessitating a more complex, slower, implementation.
  2. doesn't slow down unusual usages of sycl::queue/ sycl::kernel_bundle.
  3. doesn't slow down virtual_memory, bindless images, and usm_p2p extensions, or require associating them with a sycl::queue via a spec change.

A final general reason is that it would require more refactoring on the ur implementation compared to the suggested fix, beyond what I have already outlined would be required for reasons 1-3.

In practice this doesn't change how users interact with sycl outside of MPI (or if they want to prevent devices being used for another reason). Single process users will be able to continue to use the default sycl::context and ignore sycl::context entirely. If they want to use MPI with only specific devices visible, then they just need to manually create the context as described in codeplaysoftware/SYCL-samples#33
I suppose the only downside would be this means you can't deprecate sycl::context, but this seems like it definitely isn't happening. We need to get application critical things like MPI working fluidly asap.

Semantics wrt CUDA Runtime

Essentially we swap the functionality of cudaSetDevice() with usage of sycl::context(Devices) that is then passed to Queue/Kernel_bundle/virtual mem ext etc Unlike in CUDA Runtime, in sycl Devices can be greater than one, since this information is used in some backends to implement sycl::buffer and allow easier opencl interoperability. For better or for worse sycl chose this implicit device setting idiom. We can note that if it were not for the MPI constraint, we could use sycl::platform for this device setting purpose (as we currently do in the existing impl), since setting all available devices always by default would never encounter problems. MPI requires that we "promote" sycl::context in this way for cuda/hip backends. I imagine that MPI also applies this constraint on L0 backends (but guess this is already satisfied) (update: I looked into the l0 adapter impl: Are there not similar implementation issues in the level_zero backend due to the implicit device setting nature of SYCL?
I looked into the l0 adapter implementation: I think it is possible that l0 has similar issues, depending on ipc usage: I think it possible that if intel/llvm#15251 could be built for PVC you will see similar issues).

This also all fits in with the description of sycl::context in the sycl specification. Effectively we are giving sycl::context the job it has been waiting for.

Taking account that CUevent are no longer valid
if CUcontext creating them is destroyed.

- Fix format.

- Instantiate dummy vars with nullptr.

- Pass EvBase by ref.

Signed-off-by: JackAKirk <[email protected]>
@JackAKirk JackAKirk marked this pull request as ready for review September 13, 2024 13:19
@JackAKirk JackAKirk requested review from a team as code owners September 13, 2024 13:19
@JackAKirk
Copy link
Contributor Author

JackAKirk commented Sep 13, 2024

I think that the device_num.cpp failure must be unrelated. I see it happening sometimes in other PRs. See:
#2089

for (auto &Dev : Devices) {
urDeviceRetain(Dev);
Dev->retainNativeContext();
Copy link
Contributor

@hdelan hdelan Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would personally prefer to see all of this logic for contexts happen in ur_queue_handle_t_s. This avoids giving sycl::contexts extra semantics for the CUDA backend. Within urQueueCreate you could call something like ur_device_handle_t_::init_device() which would retain the primary ctx and then set the base event, which would then be cached in the device, so if another queue is created for the same device, it doesn't need to do the same base event getting, info querying, etc.

Let's see what @npmiller thinks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing it in the queue would be nicer, it means we wouldn't need the user code changes dealing with sycl::context since the queue already works with just one device. Not 100% sure if would work well with the event timing though.

Copy link
Contributor Author

@JackAKirk JackAKirk Sep 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I'll close this PR. I've described how I think it would work with event timings and any other edge case limitations when using the queue/kernel_bundle/etc to instantiate CUcontext here: #2077 (comment)
An third alternative of creating it on demand and allowing the number of CUcontext to grow to \infty and then relying on final dpc++ runtime to reset CUcontext, which I think has similar edge case limitations in addition to the question of whether CUcontext references can grow to \infty without consequences.

detail::ur::assertion(cuMemGetInfo(&FreeMemory, &TotalMemory) ==
CUDA_SUCCESS,
"failed cuMemGetInfo() API.");
UR_CHECK_ERROR(cuDevicePrimaryCtxRelease(hDevice->get()));
return ReturnValue(FreeMemory);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I seem to remember there was multiple properties that needed a context, are you sure this is enough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'm pretty sure:

sycl-ls tests all supported aspects by the test device that are exposed in dpc++. This includes examples of

  • cuDeviceGetPCIBusId
  • cuDeviceGetUuid_v2 or cuDeviceGetUuid depending on toolkit version
  • cuDriverGetVersion
  • cuDeviceGetAttribute : this is most cases
  • cuDeviceGetName
  • cuDeviceGetPCIBusId

cuDeviceTotalMem is tested in another test without creating a context explicitly or implicitly: https://github.com/intel/llvm/blob/5cd9de100f8df3692b492e22d056e88798873ceb/sycl/test-e2e/syclcompat/device/device.cpp#L167

This leaves cuMemGetInfo as the only one that fails without CUcontext set from the sycl-ls test. This is the only api that takes a variable that depends on a CUcontext being set: I think the api tells you how much memory is left for the CUcontext to use.

@npmiller
Copy link
Contributor

We should be able to add an adapter test, to check that the context isn't active when doing just device queries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
conformance Conformance test suite issues. cuda CUDA adapter specific issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CUDA][HIP] too many process spawned on multiple GPU systems
3 participants