Use MirroredArray instead of numpy arrays to reduce boilerplate #671

connorjward · 2022-07-29T12:29:13Z

@JDBetteridge and I took a harder look at your code yesterday and came up with an idea that we think could reduce a large amount of boilerplate.

The key idea is the introduction of what I have called a MirroredArray. It is effectively a host/device 'aware' version of a numpy array. It would have a few responsibilities:

Ensure that the data on the host/device is up-to-date when needed
Do the correct transformation to a PETSc Vec depending on the backend
Return an appropriate pointer for passing to the kernel (_kernel_args_), depending on whether or not offloading is enabled

I think this would be good for removing lots of boilerplate because it would (I think) remove the need for us to have separate implementations of Dat, ExtrudedSet, Global, etc per backend. All that would need to be done instead is to replace any numpy arrays that we want to exist on both host and device with MirroredArrays.

What I've provided here is quite a rough sketch of what I think such a solution would look like. The key thing I have not yet tackled is how we might transform these arrays into PETSc Vecs with context managers and such.

Please let me know your thoughts. @JDBetteridge and I would be very happy to have a call with you at some point to discuss it further.

connorjward · 2022-07-29T12:33:22Z

pyop2/array.py

+if configuration["backend"] == "OPENCL":
+    # TODO: Instruct the user to pass
+    # -viennacl_backend opencl
+    # -viennacl_opencl_device_type gpu
+    # create a dummy vector and extract its associated command queue
+    x = PETSc.Vec().create(PETSc.COMM_WORLD)
+    x.setType("viennacl")
+    x.setSizes(size=1)
+    queue_ptr = x.getCLQueueHandle()
+    cl_queue = pyopencl.CommandQueue.from_int_ptr(queue_ptr, retain=False)


This was just a hack to create a global queue object since I think a compute_backend object may not be required.

connorjward · 2022-07-29T12:35:09Z

pyop2/array.py

+            self.device_to_host_copy()
+
+    @property
+    def vec(self):


Subclasses would need to implement the right thing here. I have not tried to make my code work for can_be_represented_as_petscvec but I don't think it would require a radical rethink.

connorjward · 2022-07-29T12:36:19Z

pyop2/types/map.py

-                                            (iterset.total_size, arity), allow_none=True)
-        self.shape = (iterset.total_size, arity)
+        shape = (iterset.total_size, arity)
+        self._values_array = MirroredArray.new(values, dtypes.IntType, shape)


Note how if we do this then backends.opencl.Map can go away.

kaushikcfd

Thanks! Some of the logic here would force unnecessary host<->device copies, making it hard to evaluate purely from the decrease in the LOC. I agree that there is some duplication between the OpenCL and CUDA backends. But not sure whether this MirroredArray abstraction is the way to go.

(Yep we should schedule a call!)

kaushikcfd · 2022-08-01T14:30:24Z

pyop2/array.py

+            self._host_data = np.zeros(shape, dtype=dtype)
+        else:
+            self._host_data = verify_reshape(data, dtype, shape)
+        self.availability = ON_BOTH


Why is the availability ON_BOTH here?

This is probably wrong if data is not None. For simplicity I was assuming that the array would be initialised with a valid copy on both host and device. This definitely doesn't need to be the case (and I doubt I have implemented it correctly anyway).

kaushikcfd · 2022-08-01T14:31:52Z

pyop2/array.py

+        # lazy but for now assume that the data is always modified if we access
+        # the pointer


I agree we should do this right now. But probably we could patch PyOP2 in the future where a ParLoop tells us what is the access type.

the parloop does tell you the access type? otherwise nothing would work

A simple solution here would be to replace array.kernel_arg with array.get_kernel_arg(access) which would return either array.{host,device}_ptr_ro or array.{host,device}_ptr as appropriate. I've actually commented this out below.

kaushikcfd · 2022-08-01T14:33:48Z

pyop2/array.py

+if configuration["backend"] == "OPENCL":
+    # TODO: Instruct the user to pass
+    # -viennacl_backend opencl
+    # -viennacl_opencl_device_type gpu
+    # create a dummy vector and extract its associated command queue
+    x = PETSc.Vec().create(PETSc.COMM_WORLD)
+    x.setType("viennacl")
+    x.setSizes(size=1)
+    queue_ptr = x.getCLQueueHandle()
+    cl_queue = pyopencl.CommandQueue.from_int_ptr(queue_ptr, retain=False)


I'm assuming this would go away.

Possibly. What I'm trying to illustrate here is that we can do without the OpenCLBackend class since we don't need to maintain different Dat, Global, etc subclasses.

kaushikcfd · 2022-08-01T14:57:21Z

pyop2/array.py

+        self.ensure_availability_on_host()
+        self.availability = ON_HOST
+        v = self._host_data.view()
+        v.setflags(write=True)
+        return v


This would force a device->host, which isn't necessary and would be performance limiting.

I think a device -> host copy is needed here as otherwise the numpy array you get back might be wrong?

IMO we should return the array on the device (cl.array.Array, pycuda.GPUArray) whenever we are in the offloading context. They allow us to perform numpy-like operations on the device, IMO no good reason for returning the numpy array. Wdyt?

That makes a lot of sense. I was assuming here that every time we called this function we would want to inspect the data, which would require a copy to the device.

kaushikcfd · 2022-08-01T14:58:01Z

pyop2/array.py

+        if data is None:
+            self._device_data = pyopencl.array.empty(cl_queue, shape, dtype)
+        else:
+            self._device_data = pyopencl.array.to_device(cl_queue, data)


[Minor]: This is probably missing a super().__init__(data, shape, dtype).

Yep. Good spot.

kaushikcfd · 2022-08-01T14:58:33Z

pyop2/backends/cpu.py

@@ -162,11 +163,11 @@ def reduction_begin(self):
                      MAX: mpi.MPI.MAX}.get(self.accesses[idx])

            if mpi.MPI.VERSION >= 3:
-                requests.append(self.comm.Iallreduce(glob._data,
+                requests.append(self.comm.Iallreduce(glob.data,


I'm not convinced this is correct. glob.data shouldn't necessarily return an array on the host.

I thought that MPI routines occur between host copies. Is that not the case?

We might want to resolve #671 (comment) first as we are going back and forth on what Global.data should actually return.

kaushikcfd · 2022-08-01T15:19:39Z

pyop2/array.py

+    def host_to_device_copy(self):
+        self._device_data.set(self._host_data)
+
+    def device_to_host_copy(self):
+        self._device_data.get(self._host_data)


The logic here would be different based on where this is a Dat or (Map|Global). For Dats we need to handle the special case when we don't want to synchronize the halo values as pointed by Lawrence in #574 (comment).

I suppose. Perhaps a MirroredArrayWithHalo would be the way to go.

connorjward · 2023-03-13T17:04:06Z

Closing as superseded by #691.

kaushikcfd and others added 13 commits July 27, 2022 18:01

abstract functionality of pyop2 types to be overloaded by backends

1beaaad

generalize loopy codegen to allow OpenCL/CUDA targets

0031b5d

defines an AbstractComputeBackend type

a11eccf

implements CPU backend

7fcc40a

defines offloading helper types

911a88e

implement GPU codegen helpers

1438aa5

Implements CUDA backend

16fe65f

Implements OpenCL backend

4639a81

adds a CI job: test_opencl

5d3a8b7

adds test_opencl_offloading

96f54ce

WIP: New idea

2924b35

Add MirroredArray class

6975ab0

some more tweaks

9a8baa6

connorjward commented Jul 29, 2022

View reviewed changes

fix

ba02a5b

kaushikcfd reviewed Aug 1, 2022

View reviewed changes

kaushikcfd force-pushed the gpu branch 3 times, most recently from 5bed614 to 3df2554 Compare November 18, 2022 00:16

connorjward closed this Mar 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use MirroredArray instead of numpy arrays to reduce boilerplate #671

Use MirroredArray instead of numpy arrays to reduce boilerplate #671

connorjward commented Jul 29, 2022

connorjward Jul 29, 2022

connorjward Jul 29, 2022

connorjward Jul 29, 2022

kaushikcfd left a comment

kaushikcfd Aug 1, 2022

connorjward Aug 1, 2022

kaushikcfd Aug 1, 2022

wence- Aug 1, 2022

connorjward Aug 1, 2022

kaushikcfd Aug 1, 2022

connorjward Aug 1, 2022

kaushikcfd Aug 1, 2022

connorjward Aug 1, 2022

kaushikcfd Aug 1, 2022

connorjward Aug 1, 2022

kaushikcfd Aug 1, 2022

connorjward Aug 1, 2022

kaushikcfd Aug 1, 2022

connorjward Aug 1, 2022

kaushikcfd Aug 1, 2022

kaushikcfd Aug 1, 2022

connorjward Aug 1, 2022

connorjward commented Mar 13, 2023

		# lazy but for now assume that the data is always modified if we access
		# the pointer

Use MirroredArray instead of numpy arrays to reduce boilerplate #671

Use MirroredArray instead of numpy arrays to reduce boilerplate #671

Conversation

connorjward commented Jul 29, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kaushikcfd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

connorjward commented Mar 13, 2023