Skip to content
This repository has been archived by the owner on Nov 27, 2024. It is now read-only.

Use MirroredArray instead of numpy arrays to reduce boilerplate #671

Closed
wants to merge 14 commits into from

Conversation

connorjward
Copy link
Collaborator

@kaushikcfd

@JDBetteridge and I took a harder look at your code yesterday and came up with an idea that we think could reduce a large amount of boilerplate.

The key idea is the introduction of what I have called a MirroredArray. It is effectively a host/device 'aware' version of a numpy array. It would have a few responsibilities:

  • Ensure that the data on the host/device is up-to-date when needed
  • Do the correct transformation to a PETSc Vec depending on the backend
  • Return an appropriate pointer for passing to the kernel (_kernel_args_), depending on whether or not offloading is enabled

I think this would be good for removing lots of boilerplate because it would (I think) remove the need for us to have separate implementations of Dat, ExtrudedSet, Global, etc per backend. All that would need to be done instead is to replace any numpy arrays that we want to exist on both host and device with MirroredArrays.

What I've provided here is quite a rough sketch of what I think such a solution would look like. The key thing I have not yet tackled is how we might transform these arrays into PETSc Vecs with context managers and such.

Please let me know your thoughts. @JDBetteridge and I would be very happy to have a call with you at some point to discuss it further.

Comment on lines +136 to +145
if configuration["backend"] == "OPENCL":
# TODO: Instruct the user to pass
# -viennacl_backend opencl
# -viennacl_opencl_device_type gpu
# create a dummy vector and extract its associated command queue
x = PETSc.Vec().create(PETSc.COMM_WORLD)
x.setType("viennacl")
x.setSizes(size=1)
queue_ptr = x.getCLQueueHandle()
cl_queue = pyopencl.CommandQueue.from_int_ptr(queue_ptr, retain=False)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was just a hack to create a global queue object since I think a compute_backend object may not be required.

self.device_to_host_copy()

@property
def vec(self):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Subclasses would need to implement the right thing here. I have not tried to make my code work for can_be_represented_as_petscvec but I don't think it would require a radical rethink.

(iterset.total_size, arity), allow_none=True)
self.shape = (iterset.total_size, arity)
shape = (iterset.total_size, arity)
self._values_array = MirroredArray.new(values, dtypes.IntType, shape)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note how if we do this then backends.opencl.Map can go away.

Copy link
Contributor

@kaushikcfd kaushikcfd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Some of the logic here would force unnecessary host<->device copies, making it hard to evaluate purely from the decrease in the LOC. I agree that there is some duplication between the OpenCL and CUDA backends. But not sure whether this MirroredArray abstraction is the way to go.

(Yep we should schedule a call!)

self._host_data = np.zeros(shape, dtype=dtype)
else:
self._host_data = verify_reshape(data, dtype, shape)
self.availability = ON_BOTH
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the availability ON_BOTH here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably wrong if data is not None. For simplicity I was assuming that the array would be initialised with a valid copy on both host and device. This definitely doesn't need to be the case (and I doubt I have implemented it correctly anyway).

Comment on lines +48 to +49
# lazy but for now assume that the data is always modified if we access
# the pointer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we should do this right now. But probably we could patch PyOP2 in the future where a ParLoop tells us what is the access type.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the parloop does tell you the access type? otherwise nothing would work

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A simple solution here would be to replace array.kernel_arg with array.get_kernel_arg(access) which would return either array.{host,device}_ptr_ro or array.{host,device}_ptr as appropriate. I've actually commented this out below.

Comment on lines +136 to +145
if configuration["backend"] == "OPENCL":
# TODO: Instruct the user to pass
# -viennacl_backend opencl
# -viennacl_opencl_device_type gpu
# create a dummy vector and extract its associated command queue
x = PETSc.Vec().create(PETSc.COMM_WORLD)
x.setType("viennacl")
x.setSizes(size=1)
queue_ptr = x.getCLQueueHandle()
cl_queue = pyopencl.CommandQueue.from_int_ptr(queue_ptr, retain=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming this would go away.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly. What I'm trying to illustrate here is that we can do without the OpenCLBackend class since we don't need to maintain different Dat, Global, etc subclasses.

Comment on lines +94 to +98
self.ensure_availability_on_host()
self.availability = ON_HOST
v = self._host_data.view()
v.setflags(write=True)
return v
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would force a device->host, which isn't necessary and would be performance limiting.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a device -> host copy is needed here as otherwise the numpy array you get back might be wrong?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should return the array on the device (cl.array.Array, pycuda.GPUArray) whenever we are in the offloading context. They allow us to perform numpy-like operations on the device, IMO no good reason for returning the numpy array. Wdyt?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes a lot of sense. I was assuming here that every time we called this function we would want to inspect the data, which would require a copy to the device.

Comment on lines +151 to +154
if data is None:
self._device_data = pyopencl.array.empty(cl_queue, shape, dtype)
else:
self._device_data = pyopencl.array.to_device(cl_queue, data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Minor]: This is probably missing a super().__init__(data, shape, dtype).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. Good spot.

@@ -162,11 +163,11 @@ def reduction_begin(self):
MAX: mpi.MPI.MAX}.get(self.accesses[idx])

if mpi.MPI.VERSION >= 3:
requests.append(self.comm.Iallreduce(glob._data,
requests.append(self.comm.Iallreduce(glob.data,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced this is correct. glob.data shouldn't necessarily return an array on the host.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that MPI routines occur between host copies. Is that not the case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to resolve #671 (comment) first as we are going back and forth on what Global.data should actually return.

Comment on lines +167 to +171
def host_to_device_copy(self):
self._device_data.set(self._host_data)

def device_to_host_copy(self):
self._device_data.get(self._host_data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic here would be different based on where this is a Dat or (Map|Global). For Dats we need to handle the special case when we don't want to synchronize the halo values as pointed by Lawrence in #574 (comment).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose. Perhaps a MirroredArrayWithHalo would be the way to go.

@kaushikcfd kaushikcfd force-pushed the gpu branch 3 times, most recently from 5bed614 to 3df2554 Compare November 18, 2022 00:16
@connorjward
Copy link
Collaborator Author

Closing as superseded by #691.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants