Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor[cartesian]: unexpanded sdfg cleanups #1843

Closed
wants to merge 7 commits into from

Conversation

romanc
Copy link
Contributor

@romanc romanc commented Feb 1, 2025

Description

Refactors from recent debugging sessions around transient arrays in the "unexpanded SDFG" (the one with the library nodes):

  • Remove unused **kwargs from OirSDFGBuilder
  • Forward debugging information about transient arrays to DaCe
  • Use a (constant) variable for connector prefixes of data going in to/out of the library nodes
  • Configure the lifetime of transient arrays directly in OirSDFGBuilder
  • Configure storage type of transient arrays directly in OirSDFGBuilder (for GPU targets)
  • Configure the LibraryNode's device type directly in OirSDFGBuilder (for GPU targets)
  • Enum type for StorageDevice in LayoutInfo

Sidenote on the allocation lifetime: In the orchestrated code path, we reset the allocation lifetime of transients to SDFG when we freeze the stencil with origin/domain

def freeze_origin_domain_sdfg(inner_sdfg, arg_names, field_info, *, origin, domain):
wrapper_sdfg = dace.SDFG("frozen_" + inner_sdfg.name)
state = wrapper_sdfg.add_state("frozen_" + inner_sdfg.name + "_state")
inputs = set()
outputs = set()
for inner_state in inner_sdfg.nodes():
for node in inner_state.nodes():
if (
not isinstance(node, dace.nodes.AccessNode)
or inner_sdfg.arrays[node.data].transient
):
continue
if node.has_reads(inner_state):
inputs.add(node.data)
if node.has_writes(inner_state):
outputs.add(node.data)
nsdfg = state.add_nested_sdfg(inner_sdfg, None, inputs, outputs)
_sdfg_add_arrays_and_edges(
field_info, wrapper_sdfg, state, inner_sdfg, nsdfg, inputs, outputs, origins=origin
)
# in special case of empty domain, remove entire SDFG.
if any(d == 0 for d in domain):
states = wrapper_sdfg.states()
assert len(states) == 1
for node in states[0].nodes():
state.remove_node(node)
# make sure that symbols are passed through to inner sdfg
for symbol in nsdfg.sdfg.free_symbols:
if symbol not in wrapper_sdfg.symbols:
wrapper_sdfg.add_symbol(symbol, nsdfg.sdfg.symbols[symbol])
# Try to inline wrapped SDFG before symbols are specialized to avoid extra views
inline_sdfgs(wrapper_sdfg)
_sdfg_specialize_symbols(wrapper_sdfg, domain)
for _, _, array in wrapper_sdfg.arrays_recursive():
if array.transient:
array.lifetime = dace.dtypes.AllocationLifetime.SDFG
wrapper_sdfg.arg_names = arg_names
return wrapper_sdfg

This might be relevant when tracking down orchestration performance. Seems odd at least.

Requirements

  • All fixes and/or new features come with corresponding tests.
    Covered by existing tests
  • Important design decisions have been documented in the appropriate ADR inside the docs/development/ADRs/ folder.
    N/A

@romanc romanc force-pushed the romanc/unexpanded-sdfg-cleanups branch 2 times, most recently from 76c47ad to c7aba86 Compare February 3, 2025 12:52
@romanc romanc marked this pull request as ready for review February 3, 2025 18:18
@romanc romanc force-pushed the romanc/unexpanded-sdfg-cleanups branch from 553196d to 38df291 Compare February 10, 2025 08:09
@romanc
Copy link
Contributor Author

romanc commented Feb 10, 2025

(rebased onto main branch because there was a hickup on the CSCS-CI)

@FlorianDeconinck
Copy link
Contributor

Re: freeze_origin_domain_sdfg

My cursory look seems to point that this is only called at top level for DaceLazyStencil et al. e.g. it's not called when we are doing orchestration but for the top level arguments (and then it's fine).

We should log a task to double check and we need a methodological/code way to differentiate code path we expect is in stencil mode and what is in orchestrated mode

Copy link
Contributor

@FlorianDeconinck FlorianDeconinck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good. Couple of question.

Do the enum for the device type, the string is making me sad.

@romanc romanc force-pushed the romanc/unexpanded-sdfg-cleanups branch from 38df291 to 6ea0fe4 Compare February 11, 2025 09:46
Copy link
Contributor

@FlorianDeconinck FlorianDeconinck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@FlorianDeconinck
Copy link
Contributor

@havogt Care to review or ok to push?

Copy link
Contributor

@havogt havogt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to use the DeviceType enum from _core.definitions instead of introducing a new enum. Can you check if that works?

Comment on lines 33 to 35
class StorageDevice(Enum):
CPU = 1
GPU = 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to use the DeviceType from _core.definitions. I don't have time to check how ROCM would fit here.
Maybe we can do

CurrentGPU = core_defs.DeviceType.ROCM if cp.cuda.runtime.is_hip else core_defs.DeviceType.CUDA

somewhere. cupy currently has a limitation that it is either installed for cuda or for rocm, so it would be fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, let's reuse DeviceType from the core definitions. I honestly don't know what our support level for AMD graphic cards is. From what I read in the code, it looks like that's gonna be an item on the tasklist at some point. So I personally wouldn't worry too much about ROCM for now

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ECMWF runs AMD hardware - so we need to keep this alive

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me to understand, which backend is compatible with AMD hardware? From what I read, the CudaBackend as well as the DaceGPUBackend, generate cuda code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So DaCe can HIP - I believe that's where they are getting there AMD goodies. @stubbiali, can you talk to the AMD hardware use? Am I misremembering ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both GridTools and DaCe GPU backends support AMD graphics cards. We have been running GT4Py codes on LUMI for at least 1.5 years.

Roman Cattaneo and others added 6 commits February 12, 2025 15:45
Forward debug info from gt4py to dace if we know it. If we don't know,
just don't specify instead of setting `DebugInfo(0)`.

Moved `get_dace_debuginfo()` one folder higher from expansion utils into
"general" dace utils because its not only used in expansion.
Transients are added in OirSDFGBuilder, where no array lifetime is
configured. After building that SDFG, the lifetime of all transients is
manually set to `Persistent` (which is an optimization leading to less
frequent memory allocation in case a kernel is called multiple times).
In this commit we directly specify the transient's lifetime when
building the SDFG.
For GPU targets, we have to configure the `storage_type` for transient
arrays. In addition, we have to set the library node's `device` property.
We can do both while building the SDFG instead of separate passes afterwards.
@romanc romanc marked this pull request as draft February 12, 2025 16:59
@romanc
Copy link
Contributor Author

romanc commented Feb 12, 2025

Putting this back to draft mode.

  • Re-using DeviceType from _core.definitions is possible and I think it's worth doing.
  • It's a bit of work and I might need to re-shuffle some things because tach complains that gt4py.cartesian should not depend on gt4py._core directly. I guess everything related to cpu/gpu device(s) should go through the storage module. In the end, there are many checks like "if this is a gpu device, then do this ..." and they should be centralized (in the gt_storage package or wherever) such that we can deal with CUDA / ROCM in a single place.
  • Mixing the original changes in this PR with changing the "cpu" / "gpu" strings re-use DeviceType is going to look ugly in the git history (especially when squashed).
  • I'll push a WIP state here and then probably separate the two things into separate (consecutive) PRs tomorrow.

To be split from the other changes, probably in a follow-up PR.
@romanc
Copy link
Contributor Author

romanc commented Feb 14, 2025

Closing as superseded by #1860 and #1867.

@romanc romanc closed this Feb 14, 2025
@romanc romanc deleted the romanc/unexpanded-sdfg-cleanups branch February 14, 2025 19:44
romanc added a commit that referenced this pull request Feb 17, 2025
## Description

Refactors from (recent) debugging sessions around transient arrays in the "unexpanded SDFG" (the one with the library nodes):

- Remove unused `**kwargs` from `OirSDFGBuilder`
- Forward debugging information about transient arrays to DaCe
- Use a (constant) variable for connector prefixes of data going into/out of the library nodes
- Configure the lifetime of transient arrays directly in `OirSDFGBuilder`

This is a follow-up from PR
#1843. In this PR, we separate the DaCe backend cleanup from the refactor around (re-)using `DeviceType` instead of `"cpu" | "gpu"` string literals.

## Requirements

- [x] All fixes and/or new features come with corresponding tests.
  Should be covered by existing tests.
- [ ] Important design decisions have been documented in the appropriate
ADR inside the [docs/development/ADRs/](docs/development/ADRs/Index.md)
folder.
  N/A

---------

Co-authored-by: Roman Cattaneo <>
Co-authored-by: Roman Cattaneo <[email protected]>
Co-authored-by: Florian Deconinck <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants