GPU backend, eagerness, and errors #1321

jpivarski · 2022-02-24T22:34:26Z

jpivarski
Feb 24, 2022
Maintainer

@swishdiff has started developing infrastructure to perform Awkward Array calculations on GPUs. In doing this, we're facing some questions that would have implications for users. One of these deals with concurrency.

When you launch a CUDA kernel in C++, the kernel runs asynchronously, returning control to the C++ driving program before the calculation is complete (potentially). But Awkward Array operations are eager: they finish calculating before returning control to Python. (dask-awkward is another story.) Since the whole point of a GPU backend is for better utilization of resources, we may want to adopt that model.

Each Awkward operation (ak.* functions, NumPy ufuncs, slicing, etc.) assumes that the values in an array are valid, so at minimum there needs to be a cudaSynchronize between each operation (and also between the intermediate steps that comprise a high-level operation). But if we put the cudaSynchronize at the beginning each Awkward operation,

we can get strictly greater utilization than if we put the cudaSynchronize at the end of each operation,

where "CPU" is returning to Python and doing Python stuff, while "GPU" is doing numerical calculations.

However, there's a consequence: if control is returned to Python before the numerical calculation completes, then any errors deriving from data in the arrays can't be raised as Python exceptions. If, for instance, you're trying to broadcast together two arrays that have different lengths, you'll get an error about that (the length is represented in Python, on the CPU), but if they have the same outer length and incompatible internal lengths, you won't get an error. That includes things like

first_muon_in_each_event = events.muons[:, 0]

when some events might have zero muons.

CuPy also returns control to Python before the GPU calculation completes, so I wondered what CuPy does. Here's an example of an operation that can only raise an error if you look at the data in the array: slicing by an array of integer indexes, some of which are out of bounds.

>>> import numpy as np
>>> array = np.arange(1000000) * 1.1
>>> array
array([0.0000000e+00, 1.1000000e+00, 2.2000000e+00, ..., 1.0999967e+06,
       1.0999978e+06, 1.0999989e+06])
>>> indexes = np.arange(0, 1000000, 10)
>>> indexes
array([     0,     10,     20, ..., 999970, 999980, 999990])
>>> array[indexes]
array([0.000000e+00, 1.100000e+01, 2.200000e+01, ..., 1.099967e+06,
       1.099978e+06, 1.099989e+06])
>>> array[indexes + 5]
array([5.5000000e+00, 1.6500000e+01, 2.7500000e+01, ..., 1.0999725e+06,
       1.0999835e+06, 1.0999945e+06])
>>> array[indexes + 15]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: index 1000005 is out of bounds for axis 0 with size 1000000

Okay, NumPy raises an error if the slice is wrong. What does CuPy do?

>>> import cupy as cp
>>> array = cp.arange(1000000) * 1.1
>>> array
array([0.0000000e+00, 1.1000000e+00, 2.2000000e+00, ..., 1.0999967e+06,
       1.0999978e+06, 1.0999989e+06])
>>> indexes = cp.arange(0, 1000000, 10)
>>> indexes
array([     0,     10,     20, ..., 999970, 999980, 999990])
>>> array[indexes]
array([0.000000e+00, 1.100000e+01, 2.200000e+01, ..., 1.099967e+06,
       1.099978e+06, 1.099989e+06])
>>> array[indexes + 5]
array([5.5000000e+00, 1.6500000e+01, 2.7500000e+01, ..., 1.0999725e+06,
       1.0999835e+06, 1.0999945e+06])
>>> array[indexes + 15]
array([1.6500000e+01, 2.7500000e+01, 3.8500000e+01, ..., 1.0999835e+06,
       1.0999945e+06, 5.5000000e+00])

It does not raise an error! The last element should be beyond the bounds of the array, but they evidently "wrapped around," returning 5.5, a value near the beginning. This is on purpose documentation, presumably because of this issue—it can either return control to Python before it finishes processing or it can detect the error, but not both.

We could take a similar policy, but there are more ways that Awkward operations can encounter errors in the midst of processing. How do people feel about the possibility of running calculations that suppress errors—doing some random thing like CuPy's wrap-around when the CPU-based calculation would raise an error?

Another possibility that @swishdiff and I discussed is to set a flag and raise a Python exception when you try to do the next operation. That is, if you compute A, B, and C, and B is invalid, you get the error message when C begins. It sounds like that could make debugging difficult, but we could put the name of operation B (e.g. ak.this or ak.that) in the error message.

What does everyone think?

kpedro88 · 2022-02-24T22:43:22Z

kpedro88
Feb 24, 2022

It sounds like that could make debugging difficult, but we could put the name of operation B (e.g. ak.this or ak.that) in the error message.

Error messages are often incomprehensible throughout scientific Python (not highlighting or singling out Awkward by any means: numpy, pandas, matplotlib all suffer from this at least as much), so I tend to doubt this would be much worse (without having tried it, of course).

2 replies

NJManganelli Feb 24, 2022

I think given how incomprehensible in general error messages can be, if the error is raised at C with a warning that it could be operation B that's actually faulting, that will make things orders of magnitude better. I would happily live with that, at least...

matthewfeickert Feb 25, 2022
Maintainer

I haven't thought about the technical limitations that can be hit here, though @jpivarski and @swishdiff have, so my thought in general is that if error messages can get improved this is a huge bonus all around.

I'm primarily a consumer of GPU library APIs, not a designer or developer of them, so I think I would agree with @Moelf in #1321 (comment) that if the CuPy behavior can be avoided (for basically anything else) that is probably a benefit.

Moelf · 2022-02-24T23:04:56Z

Moelf
Feb 24, 2022

It sounds like that could make debugging difficult, but we could put the name of operation B (e.g. ak.this or ak.that) in the error message.

my 2c is this still better than CuPy's behavior.

If we can reach deeper, (idk how deep is awkward plan to interface with CUDA), you can certainly check for exception during a non-blocking synchronization. (ref)

0 replies

jpivarski · 2022-02-25T19:34:23Z

jpivarski
Feb 25, 2022
Maintainer Author

It sounds like people are in favor of error messages, against following CuPy's lead in pretending they don't exist. I can get behind that.

Our plans for CUDA are to have Awkward Arrays be manually copyable between main memory and GPU global memory, and all the high-level operations (slicing, ak.*, NumPy ufuncs and overloads) would be possible for the GPU-bound arrays just as they are for the CPU-bound arrays. The original plan was to do this without any difference in semantics: take a script using Awkward Array on the CPU, add

source = ak.to_backend(source, "cuda")

at the beginning of the script and

sink = ak.to_backend(sink, "cpu")

at the end, and it would work the same.

If we put cudaSynchronize at the end of each operation, it can be exactly the same, at a cost of never being able to use both CPU and GPU at the same time. (How big of a cost would that be in practice? We don't know. It depends on how fast the GPU numerical math is compared to Python "administrative work" on the CPU.)

If we put cudaSynchronize at the beginning of each operation, it would be exactly the same for process that do not raise an error. That's something. Also, we gain the ability to utilize both CPU and GPU at the same time. The prospect of raising an error in step "C" when the actual mistake was made in step "B" is a user-visible, semantic difference, but it sounds like you all are saying that that level of confusion is "in the noise" for day-to-day confusion with error messages.

you can certainly check for exception during a non-blocking synchronization

In addition to any CUDA errors ("GPU is unplugged! Plug it back in!"), the errors we're interested in are Awkward indexing errors, like events.muons[:, 0] when some events have no muons. We only know if the slice succeeded when all kernels that implement it have completed. Also, the output arrays of step "B" must be populated with valid output before "C" can begin. Kernels are purely side-effect driven, so step "C" has to be temporally after step "B", and there must be a cudaSynchronize between them.

Oh! I just got what you're saying: there's a CUDA call (different from cudaSynchronize?) that's sent from the CPU saying, "Launch this new device kernel as soon as you're done with the previous device kernels." That would allow us to get even better occupancy:

That certainly looks desirable, though it means that the error feedback could come even later. When step "C" is requested from the CPU and returns control to Python, step "B" (which has an error) might not have finished yet. The error flag could prevent "C" from doing any work: every device thread could start by checking the error flag and refusing to work if it's set, and then steps "D", "E", and "F" all quickly skip their work to report the error only at the end.

Revised question to everybody: what if errors are only raised when

sink = ak.to_backend(sink, "cpu")

is called? We can insert enough forensic information into the error state structure to say which high-level operation (slicing, ak.*, NumPy ufuncs and overloads) it was, possibly with a repr of its arguments, like

Error occurred in

    ak.whatever(
        array=<Array:cuda type="1000000 * var * {pt: float32, eta: float32, phi: flo...">,
        some_weird_option=True,
    )

Specific error message from the kernel. (A constant string with no specific-index information.)

and possibly include a line number in the source code calling ak.whatever using Python's inspect module. (That wouldn't help on the commandline or in Jupyter, and the source code calling ak.whatever might be Coffea internals.) However, as far as Python is concerned, this error was raised by ak.to_backend, not ak.whatever.

In this model, error handling is strictly global, but it looks like CuPy has only one CUDA context, so we may be limited to that from external constraints (we use CuPy). It may be that multi-GPU handling is off the table, too: that would have to be implemented by multiple Python processes (such as through Dask).

One thing that I like about raising GPU data-dependent errors in ak.to_backend(sink, "cpu") and nowhere else is that it avoids giving the misleading impression that the error happened in step "C". The nature of the error—some indexing thing—is not going to look like it might have come from trying to copy the array back.

1 reply

jpivarski Mar 3, 2022
Maintainer Author

A follow-up to this story can be found here: #1327 (comment).

jpivarski · 2022-02-26T19:45:52Z

jpivarski
Feb 26, 2022
Maintainer Author

how deep is awkward plan to interface with CUDA

I meant to answer this question, too, but got distracted by other things. There are three plans for Awkward-CUDA integration:

Make all operations (slicing, ak.*, NumPy ufuncs and overloads) work for GPU-resident arrays just as they do for arrays in main memory, as described above.
Possibly, if (1) is not good enough: collect a DAG of operations to all get dispatched to the GPU at once when you call .compute(), just like Dask. Unfortunately, this this requires a different DAG from the one being developed in dask-awkward, since the instructions that go to the GPU have to be "run this kernel, then run that kernel," so the DAG nodes have to be kernels, but dask-awkward's is a DAG of high-level instructions, such as ak.this and ak.that. Of course we considered this early on, but concluded that the same DAG can't satisfy both goals—batching kernels for a GPU and parallelizing or distributing general tasks—at least not comfortably. In a prototype, we discovered that the kernel-level DAG would be huge for typical workflows, not a compact message for communicating between distributed computers. Since this would be a major project, we're keeping an eye on it but not planning it at the moment. Hopefully, "non-(CPU)-blocking synchronization" can get us all the occupancy we need in plan (1).
Just as Awkward Arrays can be sent to host/CPU functions JIT-compiled by Numba, we want to be able to send Awkward Arrays to device/CUDA functions JIT-compiled by numba.cuda. This isn't currently possible, but it is something we plan on doing. Some work was needed on the Numba side to make this possible (see original conversation), but some or all of that ought to be ready by now and I'm the foot-dragging one at the moment. This is independent of (1) and (2), but it might be a better user interface in some circumstances. (E.g. you get to fuse operations into a single kernel, though it's more work to set up such a function, just as it is for Numba on the CPU.)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU backend, eagerness, and errors #1321

{{title}}

Replies: 4 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

GPU backend, eagerness, and errors #1321

jpivarski Feb 24, 2022 Maintainer

Replies: 4 comments · 3 replies

kpedro88 Feb 24, 2022

NJManganelli Feb 24, 2022

matthewfeickert Feb 25, 2022 Maintainer

Moelf Feb 24, 2022

jpivarski Feb 25, 2022 Maintainer Author

jpivarski Mar 3, 2022 Maintainer Author

jpivarski Feb 26, 2022 Maintainer Author

jpivarski
Feb 24, 2022
Maintainer

Replies: 4 comments 3 replies

kpedro88
Feb 24, 2022

matthewfeickert Feb 25, 2022
Maintainer

Moelf
Feb 24, 2022

jpivarski
Feb 25, 2022
Maintainer Author

jpivarski Mar 3, 2022
Maintainer Author

jpivarski
Feb 26, 2022
Maintainer Author