-
Notifications
You must be signed in to change notification settings - Fork 35
Use MirroredArray instead of numpy arrays to reduce boilerplate #671
Conversation
if configuration["backend"] == "OPENCL": | ||
# TODO: Instruct the user to pass | ||
# -viennacl_backend opencl | ||
# -viennacl_opencl_device_type gpu | ||
# create a dummy vector and extract its associated command queue | ||
x = PETSc.Vec().create(PETSc.COMM_WORLD) | ||
x.setType("viennacl") | ||
x.setSizes(size=1) | ||
queue_ptr = x.getCLQueueHandle() | ||
cl_queue = pyopencl.CommandQueue.from_int_ptr(queue_ptr, retain=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was just a hack to create a global queue object since I think a compute_backend
object may not be required.
self.device_to_host_copy() | ||
|
||
@property | ||
def vec(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Subclasses would need to implement the right thing here. I have not tried to make my code work for can_be_represented_as_petscvec
but I don't think it would require a radical rethink.
(iterset.total_size, arity), allow_none=True) | ||
self.shape = (iterset.total_size, arity) | ||
shape = (iterset.total_size, arity) | ||
self._values_array = MirroredArray.new(values, dtypes.IntType, shape) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note how if we do this then backends.opencl.Map
can go away.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Some of the logic here would force unnecessary host<->device copies, making it hard to evaluate purely from the decrease in the LOC. I agree that there is some duplication between the OpenCL and CUDA backends. But not sure whether this MirroredArray
abstraction is the way to go.
(Yep we should schedule a call!)
self._host_data = np.zeros(shape, dtype=dtype) | ||
else: | ||
self._host_data = verify_reshape(data, dtype, shape) | ||
self.availability = ON_BOTH |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the availability ON_BOTH
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably wrong if data is not None
. For simplicity I was assuming that the array would be initialised with a valid copy on both host and device. This definitely doesn't need to be the case (and I doubt I have implemented it correctly anyway).
# lazy but for now assume that the data is always modified if we access | ||
# the pointer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree we should do this right now. But probably we could patch PyOP2
in the future where a ParLoop tells us what is the access type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the parloop does tell you the access type? otherwise nothing would work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A simple solution here would be to replace array.kernel_arg
with array.get_kernel_arg(access)
which would return either array.{host,device}_ptr_ro
or array.{host,device}_ptr
as appropriate. I've actually commented this out below.
if configuration["backend"] == "OPENCL": | ||
# TODO: Instruct the user to pass | ||
# -viennacl_backend opencl | ||
# -viennacl_opencl_device_type gpu | ||
# create a dummy vector and extract its associated command queue | ||
x = PETSc.Vec().create(PETSc.COMM_WORLD) | ||
x.setType("viennacl") | ||
x.setSizes(size=1) | ||
queue_ptr = x.getCLQueueHandle() | ||
cl_queue = pyopencl.CommandQueue.from_int_ptr(queue_ptr, retain=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming this would go away.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly. What I'm trying to illustrate here is that we can do without the OpenCLBackend
class since we don't need to maintain different Dat
, Global
, etc subclasses.
self.ensure_availability_on_host() | ||
self.availability = ON_HOST | ||
v = self._host_data.view() | ||
v.setflags(write=True) | ||
return v |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would force a device->host, which isn't necessary and would be performance limiting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a device -> host copy is needed here as otherwise the numpy array you get back might be wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we should return the array on the device (cl.array.Array, pycuda.GPUArray) whenever we are in the offloading context. They allow us to perform numpy-like operations on the device, IMO no good reason for returning the numpy array. Wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes a lot of sense. I was assuming here that every time we called this function we would want to inspect the data, which would require a copy to the device.
if data is None: | ||
self._device_data = pyopencl.array.empty(cl_queue, shape, dtype) | ||
else: | ||
self._device_data = pyopencl.array.to_device(cl_queue, data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Minor]: This is probably missing a super().__init__(data, shape, dtype)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. Good spot.
@@ -162,11 +163,11 @@ def reduction_begin(self): | |||
MAX: mpi.MPI.MAX}.get(self.accesses[idx]) | |||
|
|||
if mpi.MPI.VERSION >= 3: | |||
requests.append(self.comm.Iallreduce(glob._data, | |||
requests.append(self.comm.Iallreduce(glob.data, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced this is correct. glob.data
shouldn't necessarily return an array on the host.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought that MPI routines occur between host copies. Is that not the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to resolve #671 (comment) first as we are going back and forth on what Global.data
should actually return.
def host_to_device_copy(self): | ||
self._device_data.set(self._host_data) | ||
|
||
def device_to_host_copy(self): | ||
self._device_data.get(self._host_data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic here would be different based on where this is a Dat or (Map|Global). For Dat
s we need to handle the special case when we don't want to synchronize the halo values as pointed by Lawrence in #574 (comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose. Perhaps a MirroredArrayWithHalo
would be the way to go.
5bed614
to
3df2554
Compare
Closing as superseded by #691. |
@kaushikcfd
@JDBetteridge and I took a harder look at your code yesterday and came up with an idea that we think could reduce a large amount of boilerplate.
The key idea is the introduction of what I have called a
MirroredArray
. It is effectively a host/device 'aware' version of a numpy array. It would have a few responsibilities:_kernel_args_
), depending on whether or notoffloading
is enabledI think this would be good for removing lots of boilerplate because it would (I think) remove the need for us to have separate implementations of
Dat
,ExtrudedSet
,Global
, etc per backend. All that would need to be done instead is to replace any numpy arrays that we want to exist on both host and device withMirroredArrays
.What I've provided here is quite a rough sketch of what I think such a solution would look like. The key thing I have not yet tackled is how we might transform these arrays into PETSc Vecs with context managers and such.
Please let me know your thoughts. @JDBetteridge and I would be very happy to have a call with you at some point to discuss it further.