Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL][Graph] Add API for updating single node parameters #340

Closed
wants to merge 23 commits into from

Conversation

Bensuo
Copy link
Collaborator

@Bensuo Bensuo commented Nov 1, 2023

  • New API and wording for updating single parameters

This bears some relation to #77 but takes a different approach. The other PR requires that users keep track of both the old parameter and the new parameter when updating, where this one creates an intermediate object (essentially a handle) which can be used when updating the parameters with new values. I believe this is preferable to having to store buffers, pointers and potentially a mix of both in complex code where graph building and updating may not be located in the same scope necessarily.

Edit: Queries have been removed from this PR and are now implemented in #348

@Bensuo Bensuo added the Graph Specification Extension Specification related label Nov 1, 2023
Copy link
Collaborator

@EwanC EwanC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great start, I think we think about how the various update API designs fit together, to work out how much of the helper getDependencies()/getNodes() queries we should be providing with this PR(if any)

  • Whole graph update - record a library call, and don't have access
    to sycl inputs themselves.
  • dynamic parameter update - user doesn't want the runtime to traverse the graph,
    know exactly what nodes to change.
  • [SYCL] Add explicit update API to graph spec #77 update - user has access to sycl inputs but doesn't want
    to manually loop over every node in a graph doing dynamic parameter
    update, so runtime can do that for them. Maybe created the graph using record & replay rather than explicitly.

@EwanC EwanC force-pushed the ben/explicit-update branch 3 times, most recently from abd903e to 44cff2b Compare November 20, 2023 16:19
@EwanC EwanC force-pushed the ben/explicit-update branch from 44cff2b to 9696fe9 Compare November 29, 2023 09:02
guarantee allows the backend to provide a more optimized implementation, if
possible.

The `property::graph::updatable_graph` property also allows an executable graph

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For in order queue, the contiguous submission is always allowed.

For out of order queue, it is not allowed without updatable_graph property, it is possible to allow with updatable_graph property, but, I doubt if the graph update is always enough to remove all the data race conditions, there are lots of intermediate memory within the graph.

imo, we may remove this paragraph which describes the rare case that the update is enough to remove all the data race condition.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For in order queue, the contiguous submission is always allowed.

That's actually not the current spec wording, the spec says "Throws synchronously with error code invalid if a previous submission of graph has yet to complete execution." So even if you use an in-order queue, or set execution dependencies in an out-of-order queue, it will still be an exception. This is very restrictive, which is why we have #345 open. I imagine that #345 will merge first, and the wording here will updated to reflect that (right now it's a bit of a placeholder)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, you are correct for the current status. My option should be more precise: For in order queue, the contiguous submission is expected to always be allowed once the limitation of “host sync for multiple command-buffer submissions” is removed. And hope #345 to be merged soon, this limitation is really so restrictive.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my main idea of this comment is that: even with updatable_graph property, there's still high possibility to have data race condition due to the many intermediate tensors within the graph.


// Register nodeB dynamic parameters
dynParamInput.register(ptrX, nodeB);
dynParamInput.register(ptrY, nodeB);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the concept of the dynamic_parameter type, which acts like a handle to the parameter you want to update. However, I hate the way the registration works -- matching by value seems error prone. Could you structure dynamic_parameter like this instead?

template<typename T>
class dynamic_parameter {
 public:
  dynamic_parameter(const T& param, command_graph<graph_state::modifiable> graph, std::string label = "");
  register(const handler &cgh);
  update(const T& newval, command_graph<graph_state::executable> execGraph);
};

Usage would be like this:

size_t len = /* ... */;
int *ptrx = malloc_shared<int>(len, myQueue);
int *ptry = malloc_device<int>(len, myQueue);

// Note that we use CTAD here to deduce the template parameter of "dyanmic_parameter".
// Also note that the dynamic parameter is associated with the captured kernel argument in the
// constructor, which seems nice.
dynamic_parameter dparam_len{len};
dynamic_parameter dparam_ptrx{ptrX};
dynamic_parameter_dparam_ptry(ptry};

node nodeB = myGraph.add[&](handler& cgh) {
    // Register each dynamic parameter with the "handler".  The implementation use this to figure out the
    // node.  By the time "myGraph.add" returns, each dynamic parameter becomes associated with "nodeB".
    dparam_len.register(cgh);
    dparam_ptrx.register(cgh);
    dparam_ptry.register(cgh);

    cgh.parallel_for(sycl::range<1>{n}, [=](sycl::id<1> it) {
      const size_t i = it[0];
      ptrY[i] += ptrX[i];
    });
   });

auto execGraph = myGraph.finalize({sycl_ext::property::graph::updatable});
myQueue.ext_oneapi_graph(execGraph);

len = /* some new length */;
int *ptrz = malloc_shared<int>(len, myQueue);
int *ptrq = malloc_device<int>(len, myQueue);

// Update parameter to new values.
dparam_len.update(len, execGraph);
dparam_ptrx.update(ptrz, execGraph);
dparam_ptry.update(ptrq, execGraph);

In this proposal, registration is somewhat different. The dynamic_parameter object is constructed with the address of the parameter. The variable passed to the dynamic_parameter constructor must be the same variable that is captured by the kernel's lambda. The call to register then associates the dynamic parameter object with the node and the kernel that the node contains.

I'm assuming here that your graph compiler knows the variables that are captured by the kernel. The graph compiler can search this list of captured variables and compare their address with the address that was passed to the dynamic_parameter constructor. This address is unique, so there's no ambiguity with the registration. You can even create a dynamic_parameter for a scalar as my example shows.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestions. In principal I think having the dynamic_parameter be templated on the arg type is fine, though we had initially avoided this to allow generic storage of dynamic_parameters of a single type. This might present a problem for usability but that's my only concern really.

Registration with the handler would present significant problems with record and replay functionality I believe though, since the code may either be not accessible to make these additions (library case) or not desirable for the user to change (general record and replay case). Record and replay is being used in the primary use-case we have for developing this feature against, so compatibility in this area is very important.

I also believe that scalars still present a problem even with this suggestion. It is my understanding that since these parameters are captured by value inside the lambda, when inspected they will have a different address than the original scalar value. Also in cases where set_arg is used for scalar values then as far as I can see the handler will create it's own copy of the data internally which would have the same issue. Am I misunderstanding something here?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also believe that scalars still present a problem even with this suggestion. It is my understanding that since these parameters are captured by value inside the lambda, when inspected they will have a different address than the original scalar value. Also in cases where set_arg is used for scalar values then as far as I can see the handler will create it's own copy of the data internally which would have the same issue. Am I misunderstanding something here?

My proposal is assuming that you have have access to the address of the original variables that are captured by the kernel lambda. I'm not sure if it's easy for you to get these addresses, but I think it would be worth some extra effort. The main weakness I see with the original proposal is that it identifies kernel arguments by value, but this is error prone because two arguments might have the same value. If you could somehow get the address of the captured variables, you can use this address to uniquely associate each dynamic_parameter with its kernel argument.

I'm not concerned about the set_arg case. In this case, you can uniquely associate each dynamic_parameter with its argument by using the argument index. The main problem I'm trying to solve in this comment is the case when kernel arguments are captured by the lambda expression.

Do you think you can get access to these addresses somehow?

Copy link

@guoyejun guoyejun Dec 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the case if we could not get node from recorded graph, we can use void register(void * param); for USM.

for the case if we can get node from recorded graph, we can use void register(T param, Node node); with limitations.

for the case with explicit building API, can we change a bit to void register(T param, const handler &cgh); as below, and we even don't need to care the address.

dynamic_parameter dp_len(myGraph);
...
node nodeB = myGraph.add[&](handler& cgh) {
    dp_len.register(len, cgh);
...

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the case with explicit building API, can we change a bit to void register(T param, const handler &cgh); as below, and we even don't need to care the address.

This doesn't address my concern. What if the kernel has two parameters that have the same value as len? How will you decide which of those parameters is associated with dp_len?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think dynamic_parameter is orthogonal to buffers and accessors. For example, I think you would still need to use dynamic_parameter if you wanted to change the value of an accessor parameter in a graph. I guess the syntax would look something like:

buffer<int> buf{{N}};
accessor acc{buf};

dynamic_parameter dparam_acc{acc};

node nodeB = myGraph.add[&](handler& cgh) {
    cgh.require(acc);
    dparam_acc.register(cgh);

    cgh.parallel_for(sycl::range<1>{n}, [=](sycl::id<1> it) {
      const size_t i = it[0];
      dparam_acc.get()[i] += i;
    });
   });

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was confused by the first example that only used scalar len variable and USM pointers. Do we expect that "updated" accessor value would have all the same compile-time properties between old/new values (e.g. number of dimensions)?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we expect that "updated" accessor value would have all the same compile-time properties between old/new values (e.g. number of dimensions)?

I assume so, but this is a question to the PR authors.

Assuming the updated accessor does need to have the same compile-time properties, I think the dynamic_parameter would be able to enforce this because those compile-time properties would become part of the type of the dynamic_parameter object. The dynamic_parameter class is templated on the type of the argument, so we could enforce (at compile time) that the updated argument value had the same type (which would include accessor properties).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this let you capture enough information about the parameters, so that you can uniquely identify a parameter without assuming each parameter has a unique value?

Your proposal of a dynamic_parameter instance crossing the host/device code boundary I think should let us capture enough information to uniquely identify each parameter, as the dynamic_parameter instance itself becomes the parameter.

My main concern with this design is that it requires the user to modify the kernel code, which won't work for the record & replay usecase where existing code is wrapped. If I understand @guoyejun's description of how pytorch works, this will be their usage before updating a graph instance. What we could do is aim to provide both registration mechanisms, matching-by-value and the CGH registration, then try to communicate to the user in the spec the drawbacks of the match-by-value approach.

I'm also not sure how long it will take to implement dynamic_parameters as a cross host/device construct, there's
a lot more uncertainty in terms of implementation. E.g. Ensure that when a user has multiple instances of the same graph,
and updates only one of them, then dynamic_parameter.get() has been implemented correctly so that only kernels in that exec graph instance are updated. So having both mechanisms means we could at least prototype match-by-value for evaluation on an estimatable timescale and get feedback on how often match-by-value collisions happen in real workloads.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@EwanC yes, that's right.

to be accurate, PyTorch is not currently using the graph update, but the future refinement in our discussion with pytorch people.

Currently in PyTorch, there are two cases if the graph input USM is changed,

  1. add an extra copy (in pytorch eager mode)
  2. re-record the graph (in torch.compile mode)

// Register nodeB dynamic parameters
dynParamInput.register(ptrX, nodeB);
dynParamInput.register(ptrY, nodeB);

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding another comment here to keep this conversation separately threaded.

I don't understand how dynamic_parameter::update works for an accessor. How would you create the new accessor value? Normally, an accessor is created inside of command group scope, but there is no command group when you update the parameter values. Are you expecting that the user creates a placeholder accessor?

I'm also not clear on how parameter update works for a recorded graph. Is this why you added command_graph::get_nodes and command_graph::get_root_nodes? Is the idea that the application uses these to get specific nodes in the graph and then it figures out the parameters to those nodes by somehow knowing the values of the variables that are captured in that node? This seems to require a lot of knowledge sharing between the code that constructs the dynamic_parameter and the code that submits commands to the queue that is being recorded.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how parameter update works for a recorded graph

for the USM update, see https://github.com/reble/llvm/pull/340/files#r1422564811

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I saw that comment. Updating a USM pointer seems clear to me. My question is specifically about updating an accessor.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how dynamic_parameter::update works for an accessor. How would you create the new accessor value? Normally, an accessor is created inside of command group scope, but there is no command group when you update the parameter values. Are you expecting that the user creates a placeholder accessor?

Yes as it stands I think this would require placeholder accessors. An alternative could be to instead use the actual buffer when updating the parameter, so in the kernel the accessor would have the same properties but point to updated memory. Accessors may allow more flexibility in terms of updating the range or offset, but it's not clear to me how much that would be useful - it seems like most use cases would want the access to stay the same but simply swap out the actual memory between executions. However, changing offsets could be achieved with sub-buffers potentially which would restore some flexibility.

I'm also not clear on how parameter update works for a recorded graph. Is this why you added command_graph::get_nodes and command_graph::get_root_nodes? Is the idea that the application uses these to get specific nodes in the graph and then it figures out the parameters to those nodes by somehow knowing the values of the variables that are captured in that node? This seems to require a lot of knowledge sharing between the code that constructs the dynamic_parameter and the code that submits commands to the queue that is being recorded.

This is a partial motivation yes, and it does require some knowledge sharing as you say. In the kinds of cases we are looking at though where memory is provided by the user to a library to perform some operations that knowledge is already present in the user code, and the queries allow the user access to what was recorded to be able to update those inputs.

More broadly though these queries enable better mixed record & replay/explicit API usage by allowing nodes which have been recorded to be used as dependencies in the explicit API. Previously this mixed functionality was possible but only in a very limited way.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also curious about how the accessor is implemented, is it a struct that contains size, offset, format (for image), memory handle (to buffer/image) etc.?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks that we need to add command_graph::get_nodes back, in framework, it is used to check if the recorded graph is empty, see at https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/cuda/CUDAGraph.cpp#L227

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Support for query functions like get_nodes can be found in a separate PR now: #348

EwanC added a commit to Bensuo/unified-runtime that referenced this pull request Jan 5, 2024
This change introduces a new API that allows the kernel commands of a command-buffer to be updated
with a new configuration. For example, modified arguments or ND-Range.

The new API is defined in the following files and then source generated using scripts, so reviewers should look at:
* `scripts/core/EXP-COMMAND-BUFFER.rst`
* `scripts/core/exp-command-buffer.yml`

See [cl_khr_command_buffer_mutable_dispatch](https://registry.khronos.org/OpenCL/specs/3.0-unified/html/OpenCL_Ext.html#cl_khr_command_buffer_mutable_dispatch) as prior art. The differences between the proposed API and the above are:

* Only the append kernel entry-point returns a command handle. I imagine this will be changed in future to enable other commands to do update.
* USM,  buffer, and scalar arguments can be updated, there is not equivalent update struct for `urKernelSetArgLocal` or `urKernelSetArgSampler`
* There is no granularity of optional support for update, an implementer must either implement all the ways to update a kernel configuration, or none of them.

The CUDA adapter is the only adapter that currently implements this new feature, other adapters don't report support. This is because CUDA is already an adapter supported by UR command-buffers, and the CUDA API for updating nodes already exists as a non-optional feature.

Reviewers should review the changes in `source/adapters/cuda/` to evaluate this,

CTS tests are written to verify implementation, as there is not yet a DPC++ feature with testing to stress the code path (see reble/llvm#340 for how that feature could look).

A new test directory has been created to test the command-buffer experimental feature, `test/conformance/exp_command_buffer`, which contains tests to stress using the feature defined by this extension so that it has code coverage. Reviewers should look at the new tests added here, and new device kernels in `test/conformance/device_code` to evaluate these changes.
aarongreig pushed a commit to aarongreig/unified-runtime that referenced this pull request Jan 8, 2024
This change introduces a new API that allows the kernel commands of a command-buffer to be updated
with a new configuration. For example, modified arguments or ND-Range.

The new API is defined in the following files and then source generated using scripts, so reviewers should look at:
* `scripts/core/EXP-COMMAND-BUFFER.rst`
* `scripts/core/exp-command-buffer.yml`

See [cl_khr_command_buffer_mutable_dispatch](https://registry.khronos.org/OpenCL/specs/3.0-unified/html/OpenCL_Ext.html#cl_khr_command_buffer_mutable_dispatch) as prior art. The differences between the proposed API and the above are:

* Only the append kernel entry-point returns a command handle. I imagine this will be changed in future to enable other commands to do update.
* USM,  buffer, and scalar arguments can be updated, there is not equivalent update struct for `urKernelSetArgLocal` or `urKernelSetArgSampler`
* There is no granularity of optional support for update, an implementer must either implement all the ways to update a kernel configuration, or none of them.

The CUDA adapter is the only adapter that currently implements this new feature, other adapters don't report support. This is because CUDA is already an adapter supported by UR command-buffers, and the CUDA API for updating nodes already exists as a non-optional feature.

Reviewers should review the changes in `source/adapters/cuda/` to evaluate this,

CTS tests are written to verify implementation, as there is not yet a DPC++ feature with testing to stress the code path (see reble/llvm#340 for how that feature could look).

A new test directory has been created to test the command-buffer experimental feature, `test/conformance/exp_command_buffer`, which contains tests to stress using the feature defined by this extension so that it has code coverage. Reviewers should look at the new tests added here, and new device kernels in `test/conformance/device_code` to evaluate these changes.
EwanC added a commit to Bensuo/unified-runtime that referenced this pull request Jan 11, 2024
This change introduces a new API that allows the kernel commands of a command-buffer to be updated
with a new configuration. For example, modified arguments or ND-Range.

The new API is defined in the following files and then source generated using scripts, so reviewers should look at:
* `scripts/core/EXP-COMMAND-BUFFER.rst`
* `scripts/core/exp-command-buffer.yml`

See [cl_khr_command_buffer_mutable_dispatch](https://registry.khronos.org/OpenCL/specs/3.0-unified/html/OpenCL_Ext.html#cl_khr_command_buffer_mutable_dispatch) as prior art. The differences between the proposed API and the above are:

* Only the append kernel entry-point returns a command handle. I imagine this will be changed in future to enable other commands to do update.
* USM,  buffer, and scalar arguments can be updated, there is not equivalent update struct for `urKernelSetArgLocal` or `urKernelSetArgSampler`
* There is no granularity of optional support for update, an implementer must either implement all the ways to update a kernel configuration, or none of them.

The CUDA adapter is the only adapter that currently implements this new feature, other adapters don't report support. This is because CUDA is already an adapter supported by UR command-buffers, and the CUDA API for updating nodes already exists as a non-optional feature.

Reviewers should review the changes in `source/adapters/cuda/` to evaluate this,

CTS tests are written to verify implementation, as there is not yet a DPC++ feature with testing to stress the code path (see reble/llvm#340 for how that feature could look).

A new test directory has been created to test the command-buffer experimental feature, `test/conformance/exp_command_buffer`, which contains tests to stress using the feature defined by this extension so that it has code coverage. Reviewers should look at the new tests added here, and new device kernels in `test/conformance/device_code` to evaluate these changes.
EwanC added a commit to Bensuo/unified-runtime that referenced this pull request Jan 12, 2024
This change introduces a new API that allows the kernel commands of a command-buffer to be updated
with a new configuration. For example, modified arguments or ND-Range.

The new API is defined in the following files and then source generated using scripts, so reviewers should look at:
* `scripts/core/EXP-COMMAND-BUFFER.rst`
* `scripts/core/exp-command-buffer.yml`

See [cl_khr_command_buffer_mutable_dispatch](https://registry.khronos.org/OpenCL/specs/3.0-unified/html/OpenCL_Ext.html#cl_khr_command_buffer_mutable_dispatch) as prior art. The differences between the proposed API and the above are:

* Only the append kernel entry-point returns a command handle. I imagine this will be changed in future to enable other commands to do update.
* USM,  buffer, and scalar arguments can be updated, there is not equivalent update struct for `urKernelSetArgLocal` or `urKernelSetArgSampler`
* There is no granularity of optional support for update, an implementer must either implement all the ways to update a kernel configuration, or none of them.

The CUDA adapter is the only adapter that currently implements this new feature, other adapters don't report support. This is because CUDA is already an adapter supported by UR command-buffers, and the CUDA API for updating nodes already exists as a non-optional feature.

Reviewers should review the changes in `source/adapters/cuda/` to evaluate this,

CTS tests are written to verify implementation, as there is not yet a DPC++ feature with testing to stress the code path (see reble/llvm#340 for how that feature could look).

A new test directory has been created to test the command-buffer experimental feature, `test/conformance/exp_command_buffer`, which contains tests to stress using the feature defined by this extension so that it has code coverage. Reviewers should look at the new tests added here, and new device kernels in `test/conformance/device_code` to evaluate these changes.
EwanC added a commit to Bensuo/unified-runtime that referenced this pull request Jan 15, 2024
This change introduces a new API that allows the kernel commands of a command-buffer to be updated
with a new configuration. For example, modified arguments or ND-Range.

The new API is defined in the following files and then source generated using scripts, so reviewers should look at:
* `scripts/core/EXP-COMMAND-BUFFER.rst`
* `scripts/core/exp-command-buffer.yml`

See [cl_khr_command_buffer_mutable_dispatch](https://registry.khronos.org/OpenCL/specs/3.0-unified/html/OpenCL_Ext.html#cl_khr_command_buffer_mutable_dispatch) as prior art. The differences between the proposed API and the above are:

* Only the append kernel entry-point returns a command handle. I imagine this will be changed in future to enable other commands to do update.
* USM,  buffer, and scalar arguments can be updated, there is not equivalent update struct for `urKernelSetArgLocal` or `urKernelSetArgSampler`
* There is no granularity of optional support for update, an implementer must either implement all the ways to update a kernel configuration, or none of them.

The CUDA adapter is the only adapter that currently implements this new feature, other adapters don't report support. This is because CUDA is already an adapter supported by UR command-buffers, and the CUDA API for updating nodes already exists as a non-optional feature.

Reviewers should review the changes in `source/adapters/cuda/` to evaluate this,

CTS tests are written to verify implementation, as there is not yet a DPC++ feature with testing to stress the code path (see reble/llvm#340 for how that feature could look).

A new test directory has been created to test the command-buffer experimental feature, `test/conformance/exp_command_buffer`, which contains tests to stress using the feature defined by this extension so that it has code coverage. Reviewers should look at the new tests added here, and new device kernels in `test/conformance/device_code` to evaluate these changes.
EwanC added a commit to Bensuo/unified-runtime that referenced this pull request Jan 17, 2024
This change introduces a new API that allows the kernel commands of a command-buffer to be updated
with a new configuration. For example, modified arguments or ND-Range.

The new API is defined in the following files and then source generated using scripts, so reviewers should look at:
* `scripts/core/EXP-COMMAND-BUFFER.rst`
* `scripts/core/exp-command-buffer.yml`

See [cl_khr_command_buffer_mutable_dispatch](https://registry.khronos.org/OpenCL/specs/3.0-unified/html/OpenCL_Ext.html#cl_khr_command_buffer_mutable_dispatch) as prior art. The differences between the proposed API and the above are:

* Only the append kernel entry-point returns a command handle. I imagine this will be changed in future to enable other commands to do update.
* USM,  buffer, and scalar arguments can be updated, there is not equivalent update struct for `urKernelSetArgLocal` or `urKernelSetArgSampler`
* There is no granularity of optional support for update, an implementer must either implement all the ways to update a kernel configuration, or none of them.

The CUDA adapter is the only adapter that currently implements this new feature, other adapters don't report support. This is because CUDA is already an adapter supported by UR command-buffers, and the CUDA API for updating nodes already exists as a non-optional feature.

Reviewers should review the changes in `source/adapters/cuda/` to evaluate this,

CTS tests are written to verify implementation, as there is not yet a DPC++ feature with testing to stress the code path (see reble/llvm#340 for how that feature could look).

A new test directory has been created to test the command-buffer experimental feature, `test/conformance/exp_command_buffer`, which contains tests to stress using the feature defined by this extension so that it has code coverage. Reviewers should look at the new tests added here, and new device kernels in `test/conformance/device_code` to evaluate these changes.

// Register nodeA dynamic parameters
dynParamInput.register(0, nodeA); // Argument index 0 is ptrX
dynParamInput.register(1, nodeA); // Argument index 1 is myScalar
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
dynParamInput.register(1, nodeA); // Argument index 1 is myScalar
dynParamScalar.register(1, nodeA); // Argument index 1 is myScalar


// Register nodeB dynamic parameters
dynParamInput.register(ptrX, nodeB);
dynParamInput.register(ptrY, nodeB);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
dynParamInput.register(ptrY, nodeB);
dynParamOutput.register(ptrY, nodeB);

EwanC added a commit to Bensuo/unified-runtime that referenced this pull request Jan 18, 2024
This change introduces a new API that allows the kernel commands of a command-buffer to be updated
with a new configuration. For example, modified arguments or ND-Range.

The new API is defined in the following files and then source generated using scripts, so reviewers should look at:
* `scripts/core/EXP-COMMAND-BUFFER.rst`
* `scripts/core/exp-command-buffer.yml`

See [cl_khr_command_buffer_mutable_dispatch](https://registry.khronos.org/OpenCL/specs/3.0-unified/html/OpenCL_Ext.html#cl_khr_command_buffer_mutable_dispatch) as prior art. The differences between the proposed API and the above are:

* Only the append kernel entry-point returns a command handle. I imagine this will be changed in future to enable other commands to do update.
* USM,  buffer, and scalar arguments can be updated, there is not equivalent update struct for `urKernelSetArgLocal` or `urKernelSetArgSampler`
* There is no granularity of optional support for update, an implementer must either implement all the ways to update a kernel configuration, or none of them.

The CUDA adapter is the only adapter that currently implements this new feature, other adapters don't report support. This is because CUDA is already an adapter supported by UR command-buffers, and the CUDA API for updating nodes already exists as a non-optional feature.

Reviewers should review the changes in `source/adapters/cuda/` to evaluate this,

CTS tests are written to verify implementation, as there is not yet a DPC++ feature with testing to stress the code path (see reble/llvm#340 for how that feature could look).

A new test directory has been created to test the command-buffer experimental feature, `test/conformance/exp_command_buffer`, which contains tests to stress using the feature defined by this extension so that it has code coverage. Reviewers should look at the new tests added here, and new device kernels in `test/conformance/device_code` to evaluate these changes.
@Bensuo
Copy link
Collaborator Author

Bensuo commented Jan 24, 2024

Closing in favour of upstream PR: intel#12486

@Bensuo Bensuo closed this Jan 24, 2024
EwanC added a commit to Bensuo/unified-runtime that referenced this pull request Feb 1, 2024
This change introduces a new API that allows the kernel commands of a command-buffer to be updated
with a new configuration. For example, modified arguments or ND-Range.

The new API is defined in the following files and then source generated using scripts, so reviewers should look at:
* `scripts/core/EXP-COMMAND-BUFFER.rst`
* `scripts/core/exp-command-buffer.yml`

See [cl_khr_command_buffer_mutable_dispatch](https://registry.khronos.org/OpenCL/specs/3.0-unified/html/OpenCL_Ext.html#cl_khr_command_buffer_mutable_dispatch) as prior art. The differences between the proposed API and the above are:

* Only the append kernel entry-point returns a command handle. I imagine this will be changed in future to enable other commands to do update.
* USM,  buffer, and scalar arguments can be updated, there is not equivalent update struct for `urKernelSetArgLocal` or `urKernelSetArgSampler`
* There is no granularity of optional support for update, an implementer must either implement all the ways to update a kernel configuration, or none of them.

The CUDA adapter is the only adapter that currently implements this new feature, other adapters don't report support. This is because CUDA is already an adapter supported by UR command-buffers, and the CUDA API for updating nodes already exists as a non-optional feature.

Reviewers should review the changes in `source/adapters/cuda/` to evaluate this,

CTS tests are written to verify implementation, as there is not yet a DPC++ feature with testing to stress the code path (see reble/llvm#340 for how that feature could look).

A new test directory has been created to test the command-buffer experimental feature, `test/conformance/exp_command_buffer`, which contains tests to stress using the feature defined by this extension so that it has code coverage. Reviewers should look at the new tests added here, and new device kernels in `test/conformance/device_code` to evaluate these changes.
EwanC added a commit to Bensuo/unified-runtime that referenced this pull request Feb 5, 2024
This change introduces a new API that allows the kernel commands of a command-buffer to be updated
with a new configuration. For example, modified arguments or ND-Range.

The new API is defined in the following files and then source generated using scripts, so reviewers should look at:
* `scripts/core/EXP-COMMAND-BUFFER.rst`
* `scripts/core/exp-command-buffer.yml`

See [cl_khr_command_buffer_mutable_dispatch](https://registry.khronos.org/OpenCL/specs/3.0-unified/html/OpenCL_Ext.html#cl_khr_command_buffer_mutable_dispatch) as prior art. The differences between the proposed API and the above are:

* Only the append kernel entry-point returns a command handle. I imagine this will be changed in future to enable other commands to do update.
* USM,  buffer, and scalar arguments can be updated, there is not equivalent update struct for `urKernelSetArgLocal` or `urKernelSetArgSampler`
* There is no granularity of optional support for update, an implementer must either implement all the ways to update a kernel configuration, or none of them.
* Command-handles are reference counted in UR, and extend the lifetime of the parent command-buffer.

The CUDA adapter is the only adapter that currently implements this new feature, other adapters don't report support. This is because CUDA is already an adapter supported by UR command-buffers, and the CUDA API for updating nodes already exists as a non-optional feature.

Reviewers should review the changes in `source/adapters/cuda/` to evaluate this,

CTS tests are written to verify implementation, as there is not yet a DPC++ feature with testing to stress the code path (see reble/llvm#340 for how that feature could look).

A new test directory has been created to test the command-buffer experimental feature, `test/conformance/exp_command_buffer`, which contains tests to stress using the feature defined by this extension so that it has code coverage. Reviewers should look at the new tests added here, and new device kernels in `test/conformance/device_code` to evaluate these changes.
EwanC added a commit to Bensuo/unified-runtime that referenced this pull request Feb 5, 2024
This change introduces a new API that allows the kernel commands of a command-buffer to be updated
with a new configuration. For example, modified arguments or ND-Range.

The new API is defined in the following files and then source generated using scripts, so reviewers should look at:
* `scripts/core/EXP-COMMAND-BUFFER.rst`
* `scripts/core/exp-command-buffer.yml`

See [cl_khr_command_buffer_mutable_dispatch](https://registry.khronos.org/OpenCL/specs/3.0-unified/html/OpenCL_Ext.html#cl_khr_command_buffer_mutable_dispatch) as prior art. The differences between the proposed API and the above are:

* Only the append kernel entry-point returns a command handle. I imagine this will be changed in future to enable other commands to do update.
* USM,  buffer, and scalar arguments can be updated, there is not equivalent update struct for `urKernelSetArgLocal` or `urKernelSetArgSampler`
* There is no granularity of optional support for update, an implementer must either implement all the ways to update a kernel configuration, or none of them.
* Command-handles are reference counted in UR, and extend the lifetime of the parent command-buffer.

The CUDA adapter is the only adapter that currently implements this new feature, other adapters don't report support. This is because CUDA is already an adapter supported by UR command-buffers, and the CUDA API for updating nodes already exists as a non-optional feature.

Reviewers should review the changes in `source/adapters/cuda/` to evaluate this,

CTS tests are written to verify implementation, as there is not yet a DPC++ feature with testing to stress the code path (see reble/llvm#340 for how that feature could look).

A new test directory has been created to test the command-buffer experimental feature, `test/conformance/exp_command_buffer`, which contains tests to stress using the feature defined by this extension so that it has code coverage. Reviewers should look at the new tests added here, and new device kernels in `test/conformance/device_code` to evaluate these changes.
EwanC added a commit to Bensuo/unified-runtime that referenced this pull request Feb 7, 2024
This change introduces a new API that allows the kernel commands of a command-buffer to be updated
with a new configuration. For example, modified arguments or ND-Range.

The new API is defined in the following files and then source generated using scripts, so reviewers should look at:
* `scripts/core/EXP-COMMAND-BUFFER.rst`
* `scripts/core/exp-command-buffer.yml`

See [cl_khr_command_buffer_mutable_dispatch](https://registry.khronos.org/OpenCL/specs/3.0-unified/html/OpenCL_Ext.html#cl_khr_command_buffer_mutable_dispatch) as prior art. The differences between the proposed API and the above are:

* Only the append kernel entry-point returns a command handle. I imagine this will be changed in future to enable other commands to do update.
* USM,  buffer, and scalar arguments can be updated, there is not equivalent update struct for `urKernelSetArgLocal` or `urKernelSetArgSampler`
* There is no granularity of optional support for update, an implementer must either implement all the ways to update a kernel configuration, or none of them.
* Command-handles are reference counted in UR, and extend the lifetime of the parent command-buffer.

The CUDA adapter is the only adapter that currently implements this new feature, other adapters don't report support. This is because CUDA is already an adapter supported by UR command-buffers, and the CUDA API for updating nodes already exists as a non-optional feature.

Reviewers should review the changes in `source/adapters/cuda/` to evaluate this,

CTS tests are written to verify implementation, as there is not yet a DPC++ feature with testing to stress the code path (see reble/llvm#340 for how that feature could look).

A new test directory has been created to test the command-buffer experimental feature, `test/conformance/exp_command_buffer`, which contains tests to stress using the feature defined by this extension so that it has code coverage. Reviewers should look at the new tests added here, and new device kernels in `test/conformance/device_code` to evaluate these changes.
EwanC added a commit to Bensuo/unified-runtime that referenced this pull request Feb 7, 2024
This change introduces a new API that allows the kernel commands of a command-buffer to be updated
with a new configuration. For example, modified arguments or ND-Range.

The new API is defined in the following files and then source generated using scripts, so reviewers should look at:
* `scripts/core/EXP-COMMAND-BUFFER.rst`
* `scripts/core/exp-command-buffer.yml`

See [cl_khr_command_buffer_mutable_dispatch](https://registry.khronos.org/OpenCL/specs/3.0-unified/html/OpenCL_Ext.html#cl_khr_command_buffer_mutable_dispatch) as prior art. The differences between the proposed API and the above are:

* Only the append kernel entry-point returns a command handle. I imagine this will be changed in future to enable other commands to do update.
* USM,  buffer, and scalar arguments can be updated, there is not equivalent update struct for `urKernelSetArgLocal` or `urKernelSetArgSampler`
* There is no granularity of optional support for update, an implementer must either implement all the ways to update a kernel configuration, or none of them.
* Command-handles are reference counted in UR, and extend the lifetime of the parent command-buffer.

The CUDA adapter is the only adapter that currently implements this new feature, other adapters don't report support. This is because CUDA is already an adapter supported by UR command-buffers, and the CUDA API for updating nodes already exists as a non-optional feature.

Reviewers should review the changes in `source/adapters/cuda/` to evaluate this,

CTS tests are written to verify implementation, as there is not yet a DPC++ feature with testing to stress the code path (see reble/llvm#340 for how that feature could look).

A new test directory has been created to test the command-buffer experimental feature, `test/conformance/exp_command_buffer`, which contains tests to stress using the feature defined by this extension so that it has code coverage. Reviewers should look at the new tests added here, and new device kernels in `test/conformance/device_code` to evaluate these changes.
EwanC added a commit to Bensuo/unified-runtime that referenced this pull request Feb 8, 2024
This change introduces a new API that allows the kernel commands of a command-buffer to be updated
with a new configuration. For example, modified arguments or ND-Range.

The new API is defined in the following files and then source generated using scripts, so reviewers should look at:
* `scripts/core/EXP-COMMAND-BUFFER.rst`
* `scripts/core/exp-command-buffer.yml`

See [cl_khr_command_buffer_mutable_dispatch](https://registry.khronos.org/OpenCL/specs/3.0-unified/html/OpenCL_Ext.html#cl_khr_command_buffer_mutable_dispatch) as prior art. The differences between the proposed API and the above are:

* Only the append kernel entry-point returns a command handle. I imagine this will be changed in future to enable other commands to do update.
* USM,  buffer, and scalar arguments can be updated, there is not equivalent update struct for `urKernelSetArgLocal` or `urKernelSetArgSampler`
* There is no granularity of optional support for update, an implementer must either implement all the ways to update a kernel configuration, or none of them.
* Command-handles are reference counted in UR, and extend the lifetime of the parent command-buffer.

The CUDA adapter is the only adapter that currently implements this new feature, other adapters don't report support. This is because CUDA is already an adapter supported by UR command-buffers, and the CUDA API for updating nodes already exists as a non-optional feature.

Reviewers should review the changes in `source/adapters/cuda/` to evaluate this,

CTS tests are written to verify implementation, as there is not yet a DPC++ feature with testing to stress the code path (see reble/llvm#340 for how that feature could look).

A new test directory has been created to test the command-buffer experimental feature, `test/conformance/exp_command_buffer`, which contains tests to stress using the feature defined by this extension so that it has code coverage. Reviewers should look at the new tests added here, and new device kernels in `test/conformance/device_code` to evaluate these changes.
EwanC added a commit to Bensuo/unified-runtime that referenced this pull request Feb 9, 2024
This change introduces a new API that allows the kernel commands of a command-buffer to be updated
with a new configuration. For example, modified arguments or ND-Range.

The new API is defined in the following files and then source generated using scripts, so reviewers should look at:
* `scripts/core/EXP-COMMAND-BUFFER.rst`
* `scripts/core/exp-command-buffer.yml`

See [cl_khr_command_buffer_mutable_dispatch](https://registry.khronos.org/OpenCL/specs/3.0-unified/html/OpenCL_Ext.html#cl_khr_command_buffer_mutable_dispatch) as prior art. The differences between the proposed API and the above are:

* Only the append kernel entry-point returns a command handle. I imagine this will be changed in future to enable other commands to do update.
* USM,  buffer, and scalar arguments can be updated, there is not equivalent update struct for `urKernelSetArgLocal` or `urKernelSetArgSampler`
* There is no granularity of optional support for update, an implementer must either implement all the ways to update a kernel configuration, or none of them.
* Command-handles are reference counted in UR, and extend the lifetime of the parent command-buffer.

The CUDA adapter is the only adapter that currently implements this new feature, other adapters don't report support. This is because CUDA is already an adapter supported by UR command-buffers, and the CUDA API for updating nodes already exists as a non-optional feature.

Reviewers should review the changes in `source/adapters/cuda/` to evaluate this,

CTS tests are written to verify implementation, as there is not yet a DPC++ feature with testing to stress the code path (see reble/llvm#340 for how that feature could look).

A new test directory has been created to test the command-buffer experimental feature, `test/conformance/exp_command_buffer`, which contains tests to stress using the feature defined by this extension so that it has code coverage. Reviewers should look at the new tests added here, and new device kernels in `test/conformance/device_code` to evaluate these changes.
Bensuo pushed a commit to Bensuo/unified-runtime that referenced this pull request Feb 12, 2024
This change introduces a new API that allows the kernel commands of a command-buffer to be updated
with a new configuration. For example, modified arguments or ND-Range.

The new API is defined in the following files and then source generated using scripts, so reviewers should look at:
* `scripts/core/EXP-COMMAND-BUFFER.rst`
* `scripts/core/exp-command-buffer.yml`

See [cl_khr_command_buffer_mutable_dispatch](https://registry.khronos.org/OpenCL/specs/3.0-unified/html/OpenCL_Ext.html#cl_khr_command_buffer_mutable_dispatch) as prior art. The differences between the proposed API and the above are:

* Only the append kernel entry-point returns a command handle. I imagine this will be changed in future to enable other commands to do update.
* USM,  buffer, and scalar arguments can be updated, there is not equivalent update struct for `urKernelSetArgLocal` or `urKernelSetArgSampler`
* There is no granularity of optional support for update, an implementer must either implement all the ways to update a kernel configuration, or none of them.
* Command-handles are reference counted in UR, and extend the lifetime of the parent command-buffer.

The CUDA adapter is the only adapter that currently implements this new feature, other adapters don't report support. This is because CUDA is already an adapter supported by UR command-buffers, and the CUDA API for updating nodes already exists as a non-optional feature.

Reviewers should review the changes in `source/adapters/cuda/` to evaluate this,

CTS tests are written to verify implementation, as there is not yet a DPC++ feature with testing to stress the code path (see reble/llvm#340 for how that feature could look).

A new test directory has been created to test the command-buffer experimental feature, `test/conformance/exp_command_buffer`, which contains tests to stress using the feature defined by this extension so that it has code coverage. Reviewers should look at the new tests added here, and new device kernels in `test/conformance/device_code` to evaluate these changes.
EwanC added a commit to Bensuo/unified-runtime that referenced this pull request Feb 13, 2024
This change introduces a new API that allows the kernel commands of a command-buffer to be updated
with a new configuration. For example, modified arguments or ND-Range.

The new API is defined in the following files and then source generated using scripts, so reviewers should look at:
* `scripts/core/EXP-COMMAND-BUFFER.rst`
* `scripts/core/exp-command-buffer.yml`

See [cl_khr_command_buffer_mutable_dispatch](https://registry.khronos.org/OpenCL/specs/3.0-unified/html/OpenCL_Ext.html#cl_khr_command_buffer_mutable_dispatch) as prior art. The differences between the proposed API and the above are:

* Only the append kernel entry-point returns a command handle. I imagine this will be changed in future to enable other commands to do update.
* USM,  buffer, and scalar arguments can be updated, there is not equivalent update struct for `urKernelSetArgLocal` or `urKernelSetArgSampler`
* There is no granularity of optional support for update, an implementer must either implement all the ways to update a kernel configuration, or none of them.
* Command-handles are reference counted in UR, and extend the lifetime of the parent command-buffer.

The CUDA adapter is the only adapter that currently implements this new feature, other adapters don't report support. This is because CUDA is already an adapter supported by UR command-buffers, and the CUDA API for updating nodes already exists as a non-optional feature.

Reviewers should review the changes in `source/adapters/cuda/` to evaluate this,

CTS tests are written to verify implementation, as there is not yet a DPC++ feature with testing to stress the code path (see reble/llvm#340 for how that feature could look).

A new test directory has been created to test the command-buffer experimental feature, `test/conformance/exp_command_buffer`, which contains tests to stress using the feature defined by this extension so that it has code coverage. Reviewers should look at the new tests added here, and new device kernels in `test/conformance/device_code` to evaluate these changes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Graph Specification Extension Specification related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants