Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unify with Codeplay Graph extension. #4

Merged
merged 11 commits into from
Nov 11, 2022
149 changes: 95 additions & 54 deletions sycl/doc/extensions/proposed/sycl_ext_oneapi_graph.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -88,9 +88,9 @@ As well as benefits to the SYCL runtime, there are also advantages to the user
developing SYCL applications, as repetitive workloads no longer have to
redundantly issue the same sequence of commands. Instead, a graph is only
constructed once and submitted for execution as many times as is necessary, only
changing the data in input buffers or USM allocations. For machine learning
applications where the same command group pattern is run repeatedly for
different inputs, this is particularly useful.
changing the data in input buffers or USM allocations. For applications from
specific domains, such as machine learning, where the same command group pattern
is run repeatedly for different inputs, this is particularly useful.

=== Requirements

Expand All @@ -109,25 +109,23 @@ requirements were considered:
built-in kernels.
7. Ability to record a graph with commands submitted to different devices in the
same context.
8. A graph constructed using a device queue may be executed on another compatible
queue.
9. Capability to serialize graphs to a binary format which can then be
8. Capability to serialize graphs to a binary format which can then be
de-serialized and executed. This is helpful for offline cases where a graph
can be created by an offline tool to be loaded and run without the end-user
incurring the overheads of graph creation.
10. Backend interoperability, the ability to retrieve a native graph object from
9. Backend interoperability, the ability to retrieve a native graph object from
the graph and use that in a native backend API.

To allow for prototype implementations of this extension to be developed
quickly for evaluation the scope of this proposal was limited to a subset
of these requirements. In particular, the serialization functionality (9),
backend interoperability (10), and a profiling/debugging interface (3) were
of these requirements. In particular, the serialization functionality (8),
backend interoperability (9), and a profiling/debugging interface (3) were
omitted. As these are not easy to abstract over a number of backends without
significant investigation. It is also hoped these features can be exposed as
additive changes to the API, and so in introduced in future versions of the
EwanC marked this conversation as resolved.
Show resolved Hide resolved
extension.

Another reason for deferring a serialize/deserialize API (9) is that its scope
Another reason for deferring a serialize/deserialize API (8) is that its scope
could extend from emitting the graph in a binary format, to emitting a
standardized IR format that enables further device specific graph optimizations.

Expand All @@ -150,9 +148,15 @@ data dependencies of the command group.
Each of these mechanisms for constructing a graph have their own advantages, so
having both APIs available allows the user to pick the one which is most
suitable for them. The queue recording API allows quicker porting of existing
applications, and can capture work done by a library in the graph. While the
explicit API can better express what data is internal to the graph for
optimization, and dependencies don't need to be inferred.
applications, and can capture external work that is submitted to a queue, for
example via library function calls. While the explicit API can better express
what data is internal to the graph for optimization, and dependencies don't need
to be inferred.

It is valid to combine these two mechanisms sequentially when constructing a
graph, however it is not valid to concurrently use them. An error will be thrown
EwanC marked this conversation as resolved.
Show resolved Hide resolved
if a user attempts to use the explicit API to add a node to a graph which is
being recorded to by a queue.

== Specification

Expand Down Expand Up @@ -183,43 +187,68 @@ Table 2. Terminology.
| Concept | Description

| Graph
| `command_graph` class that stores structured commands and their dependencies.

A SYCL graph is a collection of commands (nodes) and their dependencies (edges).
From the SYCL perspective, this graph will be acyclic and directed (DAG) as
users cannot express a cycle in the core SYCL API.
| A directed and acyclic graph (DAG) of commands (nodes) and their dependencies
(edges), represented by the `command_graph` class.

| Node
| A command, which can have different attributes.

When recording a queue to construct a graph, nodes in a SYCL graph represent
each of the command group submissions of the program. Each submission
encompasses either one or both of a.) some data movement, b.) a single
asynchronous kernel launch. Nodes cannot define forward edges, only backwards
(i.e. kernels can only create dependencies on things that have already
happened). This means that transparently a node can depend on a previously
recorded graph (sub-graph), which works by creating edges to the individual nodes
in the old graph. Explicit memory operations without kernels, such as a memory
copy, are still classed as nodes under this definition, as the
{explicit-memory-ops}[SYCL 2020 specification states] that these can be seen as
specialized kernels executing on the device.

In the explicit graph building API, nodes can also represent a memory allocation/free
operation on the device.

| Edge
| Dependency between commands as a happens-before relationship.

When recording a queue to construct a graph, an edge in the SYCL graph represents
a data dependency between two nodes. These dependencies are expressed by the user
code through buffer accessors. There is also the partial ability to track USM
data dependencies provided the pointers used in the graph nodes are the same.
With the limitation that a node taking an offset USM pointer input will not be
identified as having an edge to another node taking a pointer input to the base
address of the same USM allocation.
|===

==== Explicit Graph Building API

When using the explicit graph building API to construct a graph, nodes and
edges are captured as follows.

Table 3. Explicit Graph Definition.
[%header,cols="1,3"]
|===
| Concept | Description

| Node
| In the explicit graph building API nodes are created by the user invoking
methods on a modifiable graph. Each node represent either a command-group
function, empty operation, or device memory allocation/free.

| Edge
| In the explicit graph building API edges are defined by the user. This is
either through buffer accessors, the `make_edge()` free function, or by passing
dependent nodes on creation of a new node.
|===

==== Queue Recording API

When using the record & replay API to construct a graph by recording a queue,
nodes and edges are captured as follows.

In the explicit graph building API, `make_edge()` is used to define the dependency
rather than inferring them from data dependencies.
Table 4. Recorded Graph Definition.
[%header,cols="1,3"]
|===
| Concept | Description

| Node
| Nodes in a queue recorded graph represent each of the command group
submissions of the program. Each submission encompasses either one or both of
a.) some data movement, b.) a single asynchronous kernel launch. Nodes cannot
define forward edges, only backwards (i.e. kernels can only create dependencies
on things that have already happened). This means that transparently a node can
depend on a previously recorded graph (sub-graph), which works by creating edges
to the individual nodes in the old graph. Explicit memory operations without
kernels, such as a memory copy, are still classed as nodes under this
definition, as the {explicit-memory-ops}[SYCL 2020 specification states] that
these can be seen as specialized kernels executing on the device.

| Edge
| An edge in a queue recorded graph represents a data dependency between two
nodes. These dependencies are expressed by the user code through buffer
accessors. There is also the partial ability to track USM data dependencies
provided the pointers used in the graph nodes are the same. With the limitation
that a node taking an offset USM pointer input will not be identified as having
an edge to another node taking a pointer input to the base address of the same
USM allocation.
EwanC marked this conversation as resolved.
Show resolved Hide resolved
|===

=== API Modifications
Expand Down Expand Up @@ -316,7 +345,8 @@ Parameters:

Exceptions:

* TODO - Throw if this introduces a cycle?
* Throws synchronously with error code `invalid` if a queue is recording
commands to any graph associated with `sender` or `receiver`.

=== Graph

Expand Down Expand Up @@ -371,7 +401,7 @@ create the executable graphs, with the nodes added in the same order.

==== Graph Member Functions

Table 3. Constructor of the `command_graph` class.
Table 5. Constructor of the `command_graph` class.
[cols="2a,a"]
|===
|Constructor|Description
Expand All @@ -397,7 +427,7 @@ Parameters:

|===

Table 4. Member functions of the `command_graph` class.
Table 6. Member functions of the `command_graph` class.
[cols="2a,a"]
|===
|Member function|Description
Expand All @@ -418,6 +448,11 @@ Parameters:

Returns: The empty node which has been added to the graph.

Exceptions:

* Throws synchronously with error code `invalid` if a queue is recording
commands to the graph.

|
[source,c++]
----
Expand All @@ -437,6 +472,11 @@ Parameters:

Returns: The command-group function object node which has been added to the graph.

Exceptions:

* Throws synchronously with error code `invalid` if a queue is recording
commands to the graph.

|
[source,c++]
----
Expand Down Expand Up @@ -467,7 +507,7 @@ Memory that is allocated by the following functions is owned by the specific
graph. When freed inside the graph, the memory is only accessible before the
`free` node is executed and after the `malloc` node is executed.

Table 5. Member functions of the `command_graph` class (memory operations).
Table 7. Member functions of the `command_graph` class (memory operations).
[cols="2a,a"]
|===
|Member function|Description
Expand All @@ -489,6 +529,11 @@ Parameters:

Returns: The memory allocation node which has been added to the graph

Exceptions:

* Throws synchronously with error code `invalid` if a queue is recording
commands to the graph.

|
[source,c++]
----
Expand All @@ -506,13 +551,12 @@ Returns: The memory freeing node which has been added to the graph.

Exceptions:

* TODO - Throw if not allocated by `add_malloc_device`?
* TODO - Throw if already freed?
* TODO - Throw if not valid address?
* Throws synchronously with error code `invalid` if a queue is recording
commands to the graph.

|===

Table 6. Member functions of the `command_graph` class (executable graph update).
Table 8. Member functions of the `command_graph` class (executable graph update).
[cols="2a,a"]
|===
|Member function|Description
Expand Down Expand Up @@ -580,7 +624,7 @@ The state of a queue can be queried with `queue::get_info` using template
parameter `info::queue::state`. The following entry is added to the
{queue-info-table}[queue info table] to define this query:

Table 7. Queue info query
Table 9. Queue info query
[cols="2a,a,a"]
|===
| Queue Descriptors | Return Type | Description
Expand Down Expand Up @@ -730,9 +774,6 @@ there would be no thread safe way for a user to check they could call these
functions without throwing, as a query about the state of the queue may be
immediately stale.

* TODO - error on add_node while being recorded to a queue? or queue recording a
graph with explicitly build nodes?

=== Storage Lifetimes

The lifetime of any buffer recorded as part of a submission
Expand Down