Deferred Allocation #1704

ThrudPrimrose · 2024-10-24T09:16:45Z

This is possibly the start of a long PR that will evolve, and we should discuss it at every step.

The main idea is as follows (so far in the prototype):

Idea 1:
When an array is created, let's say "A", then also a size array is created "A_size". The Size array is always one-dimensional and contiguously stored. When there is an A_size -> A (IN connector needs to be _size), we trigger a reallocation operation and not a copy operation. I am not sure if we should have an A_size array, I would like to discuss this.

I have a small SDFG that shows what it looks like.

import dace

def one():
    sdfg = dace.sdfg.SDFG(name="deferred_alloc_test")

    sdfg.add_array(name="A", shape=("__dace_defer", "__dace_defer"), dtype=dace.float32, storage=dace.dtypes.StorageType.Default, transient=True)

    state = sdfg.add_state("main")

    an_1 = state.add_access('A')
    an_1.add_in_connector('IN_size')

    an_2 = state.add_scalar(name="dim0", dtype=dace.uint64)
    an_3 = state.add_scalar(name="dim1", dtype=dace.uint64)

    s_an_1 = dace.nodes.AccessNode(data="A_size")
    state.add_node(s_an_1)
    s_an_2 = dace.nodes.AccessNode(data="A_size")
    state.add_node(s_an_2)
    state.add_edge(s_an_1, None, s_an_2, None,
                dace.Memlet(None) )

    state.add_edge(s_an_2, None, an_1, 'IN_size',
                dace.Memlet(expr="A_size[0:2]") )

    state.add_edge(an_2, None, s_an_1, None,
                dace.Memlet(expr="A_size[0]") )
    state.add_edge(an_3, None, s_an_2, None,
                dace.Memlet(expr="A_size[1]") )


    sdfg.save("def_alloc_1.sdfg")
    sdfg.validate()

    return sdfg

s = one()
s.generate_code()
s(dim0=512, dim1=512)

Generated code looks like follows:

And SDFG:

Next Step:
Not 100% of this array will always be used; for example, when calling realloc, only the "__dace_defer" dimensions will be read from the array. And the dimensions that are passed using symbols will be read from the respective symbols (this can't be changed throughout the SDFG). (I am currently working on this step)

tbennun · 2024-10-24T15:26:12Z

I would not use IN_size (i.e., a throughput connector), maybe use just size.

ThrudPrimrose · 2024-10-24T16:53:01Z

I would not use IN_size (i.e., a throughput connector), maybe use just size.

About the connector, we can also read the size, then this will be "OUT_size" I had planned. Would still suggest both connectors to be named size?

tbennun · 2024-10-24T22:47:52Z

I would not use IN_size (i.e., a throughput connector), maybe use just size.

About the connector, we can also read the size, then this will be "OUT_size" I had planned. Would still suggest both connectors to be named size?

yes, exactly. Since those connectors are endpoints, and not flowing through the access node (as they would in a scope node).

ThrudPrimrose · 2024-10-29T12:59:09Z

I would not use IN_size (i.e., a throughput connector), maybe use just size.

About the connector, we can also read the size, then this will be "OUT_size" I had planned. Would still suggest both connectors to be named size?

yes, exactly. Since those connectors are endpoints, and not flowing through the access node (as they would in a scope node).

I was updating the implementation according to a discussion we had with Torsten and Lex. I also changed the implementation to use "size" for both in and out connectors. Now the problem is that this SDFG does not validate because the access node has duplicate connectors. I think it is better to make the size connectors distinct rather than changing the validation rules. Or should I update the validation procedures with access nodes being an exception?

…s name

…omplex shapes, add tests

ThrudPrimrose · 2024-12-12T10:01:28Z

I want to share the design document.

With the proposal, we support dynamic allocation and reallocation of GPU_Global and CPU_Heap arrays.
The only supported DaCe type for reallocation is dace.data.Array type.

On the CPU_Heap array, reallocation is performed through a call to realloc; on GPU_Global storage, it is implemented through a sequence of malloc, copy, free as CUDA does not support realloc.

Reallocation is only allowed on the host-side code and can be triggered when the scope is None. (Reallocation inside a map or nested SDFGs is, for example, impossible, as realloc / malloc are not usually thread-safe, and it would not have good performance if it had a thread-safe implementation.)

A deferred array is generated only upon the request of the user. This can be done by including the symbol __dace_defer in any of the expressions in the shape of the array when it is added. It is invalid to reshape an array that is deferred. If the expression has multiple appearances of the __dace_defer symbol, it is assumed to be the same symbol.

The array size is tracked by a unique size_array connected to a deferred array. It is stored in the _arrays dictionary of the SDFG, just like other arrays—the constructor of dace.data.Array sets is_size_arary to true for size arrays and is_deferred_array to true for deferred arrays. The members are set only for arrays, not other DaCe data types. The size_array of an array is tracked through the size_desc_name variable of an array. The size array is created by appending the _size suffix to the end of the array’s name.

The size array is always one-dimensional. Its length matches the number of dimensions (length of the shape) of the deferred array. No size array is created if the array is not deferred.
The size array copies the dimensions that do not have __dace_defer as initial values of the size array. If an array has the shape A[2N, 4__dace_defer], then the size array is initialized as A_size[2]{2*N, 0}. The dimensions that do not have __dace_defer are not accessed by the codegen, yet they are written in the size array to allow the user to access them. When an offset expression is computed

The reallocation is triggered by writing to the special _write_size in the connector of an access node. As part of reallocation, the user provides only the size of the __dace_defer, and the new value of the __dace_defer symbol is used to compute the new to the size_array. If the new shape (of dimension 1) written by the user is 5, then for the A above, the latest value of the __dace_defer will be written as 5, and the size will be A_size[1] = 5. But the dimension of A is [2N][45], any call to the existing functions from the cpp module still calculates the dimensions using the shape member of the array. To support __dace_defer symbol and deferred allocation, some functions accept a variable called deferred_size_names or automatically generate these parameters by detecting __dace_defer symbols in the shape. The offset expressions are matched against the __dace_defer_dim(d+) pattern and switched with accesses to the size_array on the host to process. On GPU kernels, the pattern matching is slightly different.

The length of the write to the _write_size in the connector always needs to match the size of the array's shape. The values on the dimensions that do now have _dace_defer in their expressions are ignored.

The size of the array can be read from the _read_size out connector of an access node and used in maps.

Even though the size arrays are allocated on the stack, they require a call to cudamemcpy to transfer the size array to GPU to be transferred as pointers. To mitigate the issue (and have a more performant implementation than needing one more memcpy before every kernel), the size array is unpacked into integers. The name mangling pattern is as follows: the current value of the DimensionId’th _dace_defer symbol from the array A is read from A_size[i] and mangled as ___dim_size for the example, name mangled from A_size[1] would be __A_dim1_size , the integer unpacked sizes of deferred symbols are passed as integers to the kernel. The function _get_deferred_size_names method of the cpp module handles the generation of mangled names for array and offset accesses, and the CUDA codegen handles unpacking and instantiating mangled variable (integer) names.

My concerns:
_size names are misleading? Maybe I should mean that is _deferred_symbol_size?

ThrudPrimrose · 2024-12-12T10:02:07Z

The bold text is not copied over I have google doc for the same purpose:

https://docs.google.com/document/d/1fBinC5d0gpBnYD9C4M3e0zyxVkGD94LZFzzTCydM2sQ/edit?usp=sharing

ThrudPrimrose added 3 commits October 23, 2024 14:36

Early changes to support reallocation for CPU_Heap storage

b10475f

Minimal functioning realloc

fae0704

Add first prototype of deferred allocation support

023c86c

ThrudPrimrose requested review from tim0s and acalotoiu October 24, 2024 09:16

ThrudPrimrose added the no-ci Do not run any CI or actions for this PR label Oct 24, 2024

ThrudPrimrose added 14 commits October 29, 2024 16:31

Add reading the size of array, add size input as a special in connector

4aca5ee

Refactor

e1442f7

Do not rely on naming conventions but save the size array descriptor'…

dcbf2a2

…s name

Merge branch 'main' into deferred_allocation

33b9702

dace/sdfg/validation.py

e516985

Improve validation

925f8c7

More validation cases

93eae37

Add support for deferred allocation on GPU global arrays

5b55425

Non-transient support attempt 1

c783668

Improvements in GPU_Global support

dc81d69

Merge branch 'main' into deferred_allocation

400257d

Add tests

c14b91e

Change connector names

506d0aa

Add more test cases and fix some bugs

b956142

alexnick83 self-requested a review December 2, 2024 15:13

ThrudPrimrose added 5 commits December 3, 2024 11:56

Merge branch 'main' into deferred_allocation

c4eef0c

Bug fixes

82cdfde

More codegen fixes

97bc728

Split size and array storage

08cb50c

Major fixes regarding name changes etc.

ac90c86

ThrudPrimrose added 11 commits December 6, 2024 15:24

Various fixes

ee8a708

Fix validation case

f195e3f

Improve filtering for size arrays

e915607

Improve tests, improve deferred alloc check

9d646dc

Fix type check imports

3854c82

Improve validation and type checks and fix bugs

2408ad0

Build on top of the GPU codegen hack

62bc08c

Improve proposal according to PR comments, improve support for more c…

f50382b

…omplex shapes, add tests

Add tests, refactor, improve size calculation

8c2f12d

Add array length checks to cutout test

ede2704

Refactor

a6163c0

ThrudPrimrose marked this pull request as ready for review December 11, 2024 16:39

ThrudPrimrose changed the title ~~[DRAFT] Deferred Allocation Prototype~~ Deferred Allocation Dec 11, 2024

ThrudPrimrose added 15 commits December 13, 2024 12:08

Merge branch 'main' into deferred_allocation

80f6b4a

Refactor and support CPU_Pinned

ae08459

Refactor and fix GPU array index generation

bb04e1a

Fixes to size desc name checks

02a48e8

Fix to erronous assertion

da7ba8d

Test script refactor

460b75b

Merge branch 'main' into deferred_allocation

0794638

merge fix

e0472dc

Allocate array fix

92717e1

Add forgotten defined var add

592336b

Merge branch 'main' into deferred_allocation

16f6e88

Make size arary alloc C99 std compliant instead of C++11

02937e3

Merge branch 'main' into deferred_allocation

b7e6125

Allow reshaping by changing size desc shape

b9698fa

Rm getters, move funcitonality to set shape only

9755810

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deferred Allocation #1704

Deferred Allocation #1704

ThrudPrimrose commented Oct 24, 2024

tbennun commented Oct 24, 2024

ThrudPrimrose commented Oct 24, 2024

tbennun commented Oct 24, 2024

ThrudPrimrose commented Oct 29, 2024

ThrudPrimrose commented Dec 12, 2024

ThrudPrimrose commented Dec 12, 2024

Deferred Allocation #1704

Are you sure you want to change the base?

Deferred Allocation #1704

Conversation

ThrudPrimrose commented Oct 24, 2024

tbennun commented Oct 24, 2024

ThrudPrimrose commented Oct 24, 2024

tbennun commented Oct 24, 2024

ThrudPrimrose commented Oct 29, 2024

ThrudPrimrose commented Dec 12, 2024

ThrudPrimrose commented Dec 12, 2024