Skip to content

CUDA stream support

Phil Miller edited this page Mar 2, 2021 · 10 revisions

Goals

We want to provide some APIs that support applications using CUDA streams with collections. The main CUDA API we are targeting tracks CUDA events to provide dependencies within or across streams. Slide 12 of this presentation provides an overview of the CUDA event API.

The CUDA event API allows asynchronous polling of events which simplifies the implementation. The basic code looks like this:

    for ( auto& pc: color_pc) {
      pc.sort(space[j]);
      cudaEvent_t event;
      cudaEventCreate(&event);
      cudaEventRecord(event, streams[j]);
      events.push_back(event);
      ++j;
    }
    for (auto& e : events) {
      while (cudaEventQuery(e) != cudaSuccess) ;
    }

The Kokkos DeviceSpace() has the attached stream (passed during construction), which can be retrieved using the cuda_stream() method.

Beyond this, CUDA provides cudaStreamWaitEvent(stream , event), which tells CUDA to halt progress on stream until the event is complete. Users can actually enqueue future operations for the stream, but they are automatically held back by CUDA until the event completions. This allows the creation of cross-dependencies for different streams. This API is currently not planning to be supported in the initial implementation, but might be needed later for more complex dependencies.

Microbenchmarking Sort

I modified the particle sort benchmarks, written by Matt, to use CUDA events instead of explicit stream synchronization. With 8 colors, we are getting nearly a 3x speedup over the baseline of a single chunk of work. With 16 colors, we get about a 2x speedup. The conclusion is that streams can provide substantial speedups by enabling concurrent kernel dispatch across colors.

Kokkos with streams

To use Kokkos with streams, one has to provide a special wrapper around the RangePolicy, which is then passed as the second argument (with a label) to the parallel_for, etc.

  auto policy = Kokkos::Experimental::require(
    Kokkos::RangePolicy<>(space, 0, num_elms()),
    Kokkos::Experimental::WorkItemProperty::HintLightWeight
  );

The space here is a DeviceSpace(stream) initialized with the CUDA stream. I've done some experiments with the WorkItemProperty, which has three options (no hint, heavy weight, light weight). These options control which memories are used for executing the kernel. Which some micro-benchmarking, I found for the sort that light and no hint are nearly identical in performance, while heavy slows down the kernel:

timing-heavyhint1.log:|   sort2 color: 0.198738 [10]
timing-heavyhint2.log:|   sort2 color: 0.199663 [10]
timing-heavyhint3.log:|   sort2 color: 0.200205 [10]
timing-heavyhint4.log:|   sort2 color: 0.201135 [10]
timing-heavyhint5.log:|   sort2 color: 0.201383 [10]
timing-lighthint1.log:|   sort2 color: 0.14192 [10]
timing-lighthint2.log:|   sort2 color: 0.142516 [10]
timing-lighthint3.log:|   sort2 color: 0.143825 [10]
timing-lighthint4.log:|   sort2 color: 0.14077 [10]
timing-lighthint5.log:|   sort2 color: 0.143892 [10]
timing-nohint1.log:|   sort2 color: 0.143003 [10]
timing-nohint2.log:|   sort2 color: 0.140005 [10]
timing-nohint3.log:|   sort2 color: 0.142604 [10]
timing-nohint4.log:|   sort2 color: 0.140307 [10]
timing-nohint5.log:|   sort2 color: 0.140636 [10]

Abstractions in VT for CUDA

Based on conversation with Matt, he would like as much of the stream management/coordination to come from VT.

  1. Kokkos stream dispatch: the DeviceSpace(stream) and require hint to Kokkos will be passed through the application wrappers we are implementing that will check whether a collection is executing and request the stream, which will be passed to the parallel_for, etc.

    • The next step here is to merge the wrappers and then use them across the codebase
    • I'm thinking of using the Clang frontend transformer we wrote to insert fences (which already finds the Kokkos::parallel_for, etc) to make this transformation uniformly across the codes
    • Matt and Roger would like us to benchmark the change with a branch to make sure it doesn't add undue overhead to the dispatches. I doubt this will be a problem with the granularity we currently have.
  2. General async operations in VT (#1238): implement a general mechanism for tracking a pollable operation in VT that produces/consumes the current epoch when created. This abstraction will be useful for operations that are operation asynchronously in the background, but should hold up the epoch's termination.

Here's the implementation of AsyncOp:

https://github.com/DARMA-tasking/vt/blob/a9484df71a21113c52907961335b4e00ca01beb1/src/vt/messaging/async_op.h#L52-L99

And the specialization of it for CUDA streams/events:

https://github.com/DARMA-tasking/vt/blob/a9484df71a21113c52907961335b4e00ca01beb1/src/vt/messaging/async_op_cuda.h#L54-L92

The test including in this PR shows how this operation can be used with MPI operations in the background, allowing for even better integration with MPI:

https://github.com/DARMA-tasking/vt/blob/1237-polling-epoch-task/tests/unit/active/test_async_op.cc

  1. Extending collection chain set (no PR yet): Based on discussions with Matt, the idea here is to provide an extra cleanup lambda that executes after the async CUDA operations are enqueued by next step. Here's a strawman API for the cleanup.
  chains_->nextStep("sort", [=](vt::Index2D idx) {
    return backend_proxy_(idx).template send<NullMsg, &BackendType::sort>();
  }).cleanup([=](vt::Index2D idx) {
    return backend_proxy_(idx).template send<NullMsg, &BackendType::cleanupSort>();
  }) ;

The key observation is that we want to enqueue an action after the sort event completes. However, this operation should not be inserted in the chain. Instead, it's a "hanging dependency" that is part of the outer collective epoch, but does not contain downstream dependencies. The cleanupSort action will attach after362 the sort operation completes it AsyncOpCUDA.

  1. Storable collection trait #1236: this PR implements a general mechanism to store persistent data in a collection. I plan to store streams and device spaces with this mechanism and I think it's generally useful.

  2. Streamable collection (no PR): this code will probably start in the application and possibly make its way to VT at some point.

struct StreamWrapper {

  StreamWrapper() {
    auto ret = cudaStreamCreate(&stream_);
    vtAssert(ret == cudaSuccess, "Must be able to create stream");
  }
  StreamWrapper(StreamWrapper const&) = delete;
  StreamWrapper(StreamWrapper&&) = delete;

  ~StreamWrapper() {
    auto ret = cudaStreamDestroy(stream_);
    vtAssert(ret == cudaSuccess, "Must be able to destroy stream");
  }

  cudaStream_t get() const { return stream_; }

private:
  cudaStream_t stream_ = {};
};

template <typename T, typename IdxT, typename DeviceSpaceT>
struct StreamableCollection : vt::Collection<T, IdxT> {

  StreamableCollection() {
    auto sw = std::make_unique<StreamWrapper>();
    this->valInsert("stream", std::move(sw));
    this->valInsert("device", DeviceSpaceT{sw.get()});
  }

  void epiMigrateIn() override {
    auto& sw = this->valGet<std::unique_ptr<StreamWrapper>>("stream");
    this->valInsert("device", DeviceSpaceT{sw->get()});
  }

};

SteamWrapper acts a proper C++ scoped element.