Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL][Graph] Update design doc for copy queue #362

Open
wants to merge 37 commits into
base: sycl
Choose a base branch
from

Conversation

mfrancepillois
Copy link
Collaborator

Update the design doc as command-buffer now uses the copy queue if a copy engine is available.

@mfrancepillois mfrancepillois added the Graph Implementation Related to DPC++ implementation and testing label Mar 15, 2024
@Bensuo Bensuo force-pushed the maxime/update-doc-copy-queue branch 2 times, most recently from 7739b83 to d5773eb Compare May 28, 2024 14:49
Copy link
Collaborator

@EwanC EwanC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpicked the design doc but LGTM

sycl/doc/design/CommandGraph.md Outdated Show resolved Hide resolved
sycl/doc/design/CommandGraph.md Outdated Show resolved Hide resolved
sycl/doc/design/CommandGraph.md Outdated Show resolved Hide resolved
sycl/doc/design/CommandGraph.md Outdated Show resolved Hide resolved
sycl/doc/design/CommandGraph.md Outdated Show resolved Hide resolved
Copy link
Collaborator

@EwanC EwanC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add an E2E test for a linear graph that interleaves memory and kernel commands. I'd like to verify that the interaction between our in-order command-list optimization and compute/copy command-lists works correctly.

@Bensuo
Copy link
Collaborator

Bensuo commented Jun 3, 2024

Can we add an E2E test for a linear graph that interleaves memory and kernel commands. I'd like to verify that the interaction between our in-order command-list optimization and compute/copy command-lists works correctly.

I have added a test locally but actually the way it is right now the copy command list optimization will only happen for out-of-order command buffers. It might make sense to leave that as it is for now and work on combining these two optimizations if worth it as a further piece of work?

@Bensuo Bensuo force-pushed the maxime/update-doc-copy-queue branch from 73ba45c to 4a47e54 Compare June 4, 2024 16:21
@Bensuo Bensuo force-pushed the maxime/update-doc-copy-queue branch 2 times, most recently from eb67857 to 020a5ff Compare June 11, 2024 16:41
mdtoguchi and others added 14 commits June 11, 2024 18:43
…4143)

When passing along multiple targets in the form of
-fsycl-targets=intel_gpu_dg1,intel_gpu_pvc, the number of the device
compilations was n*n as opposed to just n. Due to how we were handling
duplicate entries for toolchain generation, the different names used
even though they had the same target triple (spir64_gen) we being
considered as unique, causing the multiple entries.

This is the second attempt to push this one in, updated the
sycl-offload-new-driver.c test to reflect ordering issues encountered.
…l targets (intel#14102)

clang-linker-wrapper is not target-specific. i.e. it is not called for a
single target device. It is called only once.
Currently, clang-linker-wrapper is called only with device images with
spir64 targets. So, the existing approach to capture the first target
triple in the list of triples and use it for gathering
sycl-device-library files is valid. As we plan to add support for more
targets (AOT), we need to gather sycl-device-libraries for all targets.
This PR addresses this change.

Also, the triple should not be passed to the linker wrapper. The linker
wrapper should get the triples from device images.

Thanks

---------

Signed-off-by: Arvind Sudarsanam <[email protected]>
…ntel#14138)

Required for specific use-cases in SYCLomatic.

---------

Signed-off-by: Alberto Cabrera <[email protected]>
The `Graph/UnsupportedDevice/device_query.cpp` test asserts that L0
devices will never have full graph support. This is not the case,
depending on the L0 device and driver version full graphs support is
possible.

Update the test to remove asserting on this, as diving into these
details is out of the scope of the test. This was previously decided
when discussion how to check the OpenCL backend for similar possible
variances in aspect support.
`cuda_dev_kit` is not set properly in
[test-e2e/lit.cfg.py](https://github.com/intel/llvm/blob/sycl/sycl/test-e2e/lit.cfg.py)
due to invalid CUDA paths.
Fixing the paths showed errors in
[14115](intel#14115) and
[14116](intel#14116) which are XFAILed.
The patch fixes the failure of
[cuda_queue_priority.cpp](https://github.com/intel/llvm/blob/sycl/sycl/test-e2e/Plugin/cuda_queue_priority.cpp)
on Windows / CUDA.
This PR adds math `extend_v*4` operators (18 in total) along with
unit-tests for signed and unsigned int32 cases.
*Some changes overlap with the previous `extend_v*2` PR intel#13953 and thus
should be reviewed/merged first.

---------

Co-authored-by: Alberto Cabrera Pérez <[email protected]>
Co-authored-by: Joe Todd <[email protected]>
Co-authored-by: Yihan Wang <[email protected]>
…l#13807)

It was propagated to `getOrCreateAllocaForReq` when creating a new
record, but then no commands are expected to be enqueued there since the
first alloca for a record cannot exceed its leaf limit or be linked to
another alloca.
Running the test on Windows failed due to missing support of `ls`.
Replacing `ls` with `cat` made the test pass on Windows.
…l#14120)

- `InorderQueue/in_order_get_property.cpp` -> Use non-deprecated
`sycl::exception`, add check for errc to ensure we are still catching
the correct exception
- `InorderQueue/in_order_kernels.cpp` -> Use group `get_group_id`
function instead of deprecated `get_id`
- `InorderQueue/in_order_usm_implicit.cpp` -> Use queue `mem_advice`
function that uses `int` instead of `pi_mem_advice`
againull and others added 19 commits June 12, 2024 09:30
Currently Level Zero plugin uses loader and headers fetched by the Level
Zero adapter (LevelZeroLoader-Headers, LevelZeroLoader targets).
Currently downloaded loader code is not used, only headers are used for
xpti.
So, get headers location from LevelZeroLoader-Headers target instead and
remove unnecessary code.
Add group_key_value_sorter sorters and sort_key_value_over_group APIs
based on
https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/proposed/sycl_ext_oneapi_group_sort.asciidoc
extension.

This PR was split out from larger PR:
intel#13713

Co-authored-by: "Andrei Fedorov
[[email protected]](mailto:[email protected])"
Co-authored-by: "Romanov Vlad
[[email protected]](mailto:[email protected])"
…ntel#14129)

A single basic file to compile and run to test functionality of
--offload-new-driver

---------

Co-authored-by: Marcos Maronas <[email protected]>
…d ConvertFToBFloat16INTEL (intel#14085)

This PR adds vector overloads of `ConvertBFloat16ToFINTEL` and
`ConvertFToBFloat16INTEL` to libdevice (SPEC:
https://spec.oneapi.io/level-zero/latest/core/SPIRV.html#bfloat16-conversions)
and a wrapper around it (`BF16VecToFloatVec` and `FloatVecToBF16Vec`) in
`ext::oneapi::detail`.

These overloads are intended to optimize BFloat16 `marray`, `vec`
operations, for which we currently do element-by-element `bfloat16 ->
float -> bfloat16` conversions.
…4130)

Replaces intel#13270

Changing the storage to std::array instead of Clang's extension fixes
strict ansi-aliasing violation and simplifies device code.
[SYCL] Adding support for missing math ops:
- truncf
- sinpif
- rsqrtf
- exp10f
Add section to the contribution guide detailing the current process for
integrating Unified Runtime updates into DPC++.
…ntel#14123)

Current implementation of profiling info for NOP barriers is
inconsistent
with other events from the same queue (e.g., if the previous event
started
after the barrier was submitted). To make them consistent while keeping
the optimization, we would need to duplicate the event on our side and
make the duplicate check and potentially use profiling info of its
previous event.

Instead, as the first step, disable the NOP optimization during
profiling
since profiling is known to incur a performance hit anyway. The proper
duplicate event approach can be implemented as a follow up if this
causes issues for users.

Partially reverts intel#12949
atomic_ref<T *> uses 64-bit atomics and it should be decorated with the
corresponding aspect.

fixes: intel#12743
I'm observing cache overflow when running heavy tests on OCL backend
with gpu. Clear cache in case of PI_ERROR_OUT_OF_HOST_MEMORY as well as
for PI_ERROR_OUT_OF_RESOURCES.
Using as reference: intel#11987
Scheduled igc dev drivers uplift

Co-authored-by: GitHub Actions <[email protected]>
The function is using the `operator=` before it's defined which can
cause some build failures:

```
build/include/sycl/ext/oneapi/bfloat16.hpp:98:19: error: no match for ‘operator=’ (operand types are ‘sycl::_V1::ext::oneapi::bfloat16’ and ‘float’)
   98 |     dst[i] = src[i];
      |                   ^
```

Moving it after the bfloat16 class definition fixes it.
Bumps [braces](https://github.com/micromatch/braces) from 3.0.2 to
3.0.3.
<details>
<summary>Commits</summary>
<ul>
<li><a
href="https://github.com/micromatch/braces/commit/74b2db2938fad48a2ea54a9c8bf27a37a62c350d"><code>74b2db2</code></a>
3.0.3</li>
<li><a
href="https://github.com/micromatch/braces/commit/88f1429a0f47e1dd3813de35211fc97ffda27f9e"><code>88f1429</code></a>
update eslint. lint, fix unit tests.</li>
<li><a
href="https://github.com/micromatch/braces/commit/415d660c3002d1ab7e63dbf490c9851da80596ff"><code>415d660</code></a>
Snyk js braces 6838727 (<a
href="https://redirect.github.com/micromatch/braces/issues/40">#40</a>)</li>
<li><a
href="https://github.com/micromatch/braces/commit/190510f79db1adf21d92798b0bb6fccc1f72c9d6"><code>190510f</code></a>
fix tests, skip 1 test in test/braces.expand</li>
<li><a
href="https://github.com/micromatch/braces/commit/716eb9f12d820b145a831ad678618731927e8856"><code>716eb9f</code></a>
readme bump</li>
<li><a
href="https://github.com/micromatch/braces/commit/a5851e57f45c3431a94d83fc565754bc10f5bbc3"><code>a5851e5</code></a>
Merge pull request <a
href="https://redirect.github.com/micromatch/braces/issues/37">#37</a>
from coderaiser/fix/vulnerability</li>
<li><a
href="https://github.com/micromatch/braces/commit/2092bd1fb108d2c59bd62e243b70ad98db961538"><code>2092bd1</code></a>
feature: braces: add maxSymbols (<a
href="https://github.com/micromatch/braces/issues/">https://github.com/micromatch/braces/issues/</a>...</li>
<li><a
href="https://github.com/micromatch/braces/commit/9f5b4cf47329351bcb64287223ffb6ecc9a5e6d3"><code>9f5b4cf</code></a>
fix: vulnerability (<a
href="https://security.snyk.io/vuln/SNYK-JS-BRACES-6838727">https://security.snyk.io/vuln/SNYK-JS-BRACES-6838727</a>)</li>
<li><a
href="https://github.com/micromatch/braces/commit/98414f9f1fabe021736e26836d8306d5de747e0d"><code>98414f9</code></a>
remove funding file</li>
<li><a
href="https://github.com/micromatch/braces/commit/665ab5d561c017a38ba7aafd92cc6655b91d8c14"><code>665ab5d</code></a>
update keepEscaping doc (<a
href="https://redirect.github.com/micromatch/braces/issues/27">#27</a>)</li>
<li>Additional commits viewable in <a
href="https://github.com/micromatch/braces/compare/3.0.2...3.0.3">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=braces&package-manager=npm_and_yarn&previous-version=3.0.2&new-version=3.0.3)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts page](https://github.com/intel/llvm/network/alerts).

</details>

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…tly (intel#12872)

This PR refactors the builtin fence helper macro for AMDGPU to take in
and process the order semantic explicitly because that is the only
semantic argument accepted by the amdgcn builtin.

Additionally, makes the `None` (Monotonic) order semantic which maps to
C++/SYCL's `relaxed` to be a no-op instead of falling back to the
previous `acq_rel` default order.

---------

Co-authored-by: Kenneth Benzie (Benie) <[email protected]>
The rotate functions are technically c++20 and MSVC hasn't implemented
them yet.

Signed-off-by: Sarnie, Nick <[email protected]>
…ntel#14151)

One of the models that is used for specifying the device architecture
for spir64_gen is to use the -Xsycl-target-backend "-device arg" syntax
on the command line.

Hook up the ability to scan the target backend values to embed the
proper information in the packaged binary when using the new offload
model.
This PR adds math `extend_vcompare[2/4] `operators (4 in total) along
with unit-tests for signed and unsigned int32 cases.
Also, Unit-tests from previous `extend_v*4` intel#14078 and `extend_v*2`
intel#13953 are moved to two different files.

---------

Co-authored-by: Alberto Cabrera Pérez <[email protected]>
Co-authored-by: Joe Todd <[email protected]>
Co-authored-by: Yihan Wang <[email protected]>
@EwanC EwanC force-pushed the maxime/update-doc-copy-queue branch from 020a5ff to 0d13b58 Compare June 14, 2024 07:43
nrspruit and others added 4 commits June 14, 2024 14:45
…ntel#14150)

pre-commit PR for
oneapi-src/unified-runtime#1749

---------

Signed-off-by: Neil R. Spruit <[email protected]>
Co-authored-by: Kenneth Benzie (Benie) <[email protected]>
intel#14162)

Instead of using old device selector objects, use SYCL 2020 device
selector callables to construct devices in `FilterSelector` e2e tests.
- Update UR tag to include L0 command-buffer copy engine optimization
- Add test which mixes copy and kernel commands
- Update design doc to detail copy engine optimization
@EwanC EwanC force-pushed the maxime/update-doc-copy-queue branch from 5d44bdf to 01b1582 Compare June 14, 2024 13:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Graph Implementation Related to DPC++ implementation and testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.