Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

0-dimensional tensor not composable #770

Closed
fschlimb opened this issue May 10, 2021 · 18 comments · Fixed by #758
Closed

0-dimensional tensor not composable #770

fschlimb opened this issue May 10, 2021 · 18 comments · Fixed by #758
Assignees

Comments

@fschlimb
Copy link
Contributor

A simple program like

import heat as ht
a = ht.ones(128, split=0)
a = a + a[0]

gives an error message. Re-balancing the 0-dimensional tensor a[0] as suggested in a warning is neither useful nor possible (same error).

I know I can cast it to a scalar, but that's not very convenient if I want to implement a generic feature.

What is needed to make the above work?

@coquelin77
Copy link
Member

it would be possible to change the behavior of getitem to distribute the single item to all processes. typically, when the shape is reduced to a single value along a dimension the split dimension is reduced by one. perhaps in the case that it is reduced to a single value the value could be distributed to split=None. what do you think @ClaudiaComito ?

@ClaudiaComito
Copy link
Contributor

Thanks @fschlimb and @coquelin77, indeed I already implemented this in #758, i.e. with a[0] returning the first element of the first dimension on all processes (with split=None), so the example above will work on branch bug/754-getitem-indexing. The PR is still a draft because there's a ton of other operations and tests that need to be adapted.

@fschlimb please do let me know if you have more examples that don't work on the indexing branch, it would be great to have this issue settled. Thanks a lot!

@ClaudiaComito ClaudiaComito self-assigned this May 10, 2021
@fschlimb
Copy link
Contributor Author

@ClaudiaComito Would it be possible to do the data-replication for 0d-tensors on-demand/lazily?

@ClaudiaComito
Copy link
Contributor

@fschlimb I've been racking my brain about this (@Markus-Goetz this is what I wanted to chat about this morning). It is very inconvenient to broadcast the data back and forth every time we slice an element. But operations on DNDarrays with split=None expect to have the data replicated on each process.

We could add a DNDarray property indicating which process contains the data, and the first operation on that element would trigger the broadcast. I guess that's what you mean by "lazily"?

So , something like:

>>> a[0].split
None
>>> a[0].lazy
0

(no data on ranks > 0)

or

>>> a[0].split
None
>>> a[0].lazy
None

after the first operation has triggered the Bcast.

Thoughts? @coquelin77 @Markus-Goetz , also @ben-bou

@fschlimb
Copy link
Contributor Author

@fschlimb I've been racking my brain about this (@Markus-Goetz this is what I wanted to chat about this morning). It is very inconvenient to broadcast the data back and forth every time we slice an element. But operations on DNDarrays with split=None expect to have the data replicated on each process.

We could add a DNDarray property indicating which process contains the data, and the first operation on that element would trigger the broadcast. I guess that's what you mean by "lazily"?

@ClaudiaComito Yes, that's what I meant.

So , something like:

>>> a[0].split
None
>>> a[0].lazy
0

(no data on ranks > 0)

or

>>> a[0].split
None
>>> a[0].lazy
None

after the first operation has triggered the Bcast.

Thoughts? @coquelin77 @Markus-Goetz , also @ben-bou

Not sure I follow the examples. Could you please elaborate?

I think in my example above the 0d array should be

  • either treated by broadcasting operations (e.g. a 0d tensor can be broadcasted to basically any other tensor shape)
  • or, when used in an tensor operation, auto-converted into a scalar (e.g. replicated).

I am currently working on an experiment which would be based on the latter mechanism (converting to scalar). For that however, I need to know which rank owns the value. I know this is not necessarily a common use case, but maybe it would be good to have both, the automatic broadcast or conversion, as well as the ability to retrieve the owner of the value.

@ben-bou
Copy link
Collaborator

ben-bou commented May 11, 2021

I don't understand how delaying the Bcast ("replication") until the first evaluation would be advantageous. I think this would only be the case if there is no replication to all processes happening but rather the empty processes can "request" the data whenever they need it. How would this be implemented, would the root process have to spawn a "listener" to answer the requests? How can the root process guarantee that the data has not been changed locally?

Alternatively, and I guess this is what is meant here, this case could be seen as an extremely unbalanced dndarrray, and then be redistributed when needed (most notably in the __setitem__ and __binary_op__). I've done some very rudimentary work on such a redistribution, but couldn't figure out how to do it in the case that the split-axis doesn't exist anymore (Apart from simple replication, see above). This is then related to the discussion around #754 , where we found that forcing to keep the split-axis is not feasible and decided for the "eager" replication.

The encountered problems could probably be circumvented by introducing the lazy attribute. However, I would propose to make this non-distributed dndarray a subclass of DNDarray (or possibly even torch.Tensor), because

  1. it is not a feasible dndarray, e.g. the concepts of lshapes and larrays on the empty processes just don't make sense (we had this in the discussion around Wrong shape after indexing a single element with a slice #754 ), but rather just a torch.Tensor with the additional attribute of the owner_rank. Therefore it supports every torch.Tensor operation by prefacing it with if self.rank == owner_rank and every DNDarray operation by additionally using comm = MPI_SELF and split = None.
  2. it behaves differently, e.g.
    • a redistribute is essentially a scatter
    • in the __setitem__ the values need to be gather -ed
    • a binary operation with two of them does not require both to be distributed or replicated but just moved to the same rank
    • an operation with an unsplit DNdarray can result in a non-distributed dndarray, but an operation with a split DNDarray should always result in a DNDarray (i.e. trigger a replication).

@ClaudiaComito
Copy link
Contributor

ClaudiaComito commented May 11, 2021

Not sure I follow the examples. Could you please elaborate?

I think in my example above the 0d array should be

* either treated by broadcasting operations (e.g. a 0d tensor can be broadcasted to basically any other tensor shape)

* or, when used in an tensor operation, auto-converted into a scalar (e.g. replicated).

Hi @fschlimb, I meant the MPI Bcast, sorry for the confusion. I think we're mixing things up a bit, so let me take one step back.

The metadata of a DNDarray have to be consistent with its global shape and local shapes of the process-local tensors, among other things. If I slice one element off along the split axis, I also "squeeze" that axis so I lose the split dimension:

>>> a = ht.ones((5, 6), split=0)
>>> a.split
0
>>> a.lshape
[0/2] (3, 6)
[1/2] (2, 6)
>>> a[0].lshape  # this step involves MPI_Bcast from rank 0!
[0/2]  (6,)
[1/2]  (6,)
>>> a[0].split
None

The communication step (MPI_Bcast) is what I understood you'd like to have a lazy implementation of. Are we on the same page up to here?

Your example above works under #758 because the data (a[0]) have been communicated to all processes.

@fschlimb
Copy link
Contributor Author

@ben-bou Yes, the replication might not be needed. It's a common case when developers use explicit loops and explicit array indexing.

Imagine a loop like this:

def pairwise_distance(X):
    M = X.shape[0]
    N = X.shape[1]
    D = np.empty((M, M), dtype=np.float64, split=0)
    for i in range(M):
        for j in range(M):
            d = 0.0
            for k in range(N):
                tmp = X[i, k] - X[j, k]
                d = d + tmp * tmp
            D[i, j] = sqrt(d)
    return D

Any trivial split will require that over time all data points exist on all ranks. At a certain point in time a full replication of all concurrently used 0d arrays is not desired, though.

Of course I do not expect HeAT to perform well on this. I am trying to experiment with an optimizer to address such issues.

Conceptually, there is no need to have a "listener" since every use of a DNDarray is 'collective', so in theory the owner and the consumer both know that the data is needed by the consumer.

@fschlimb
Copy link
Contributor Author

Not sure I follow the examples. Could you please elaborate?
I think in my example above the 0d array should be

* either treated by broadcasting operations (e.g. a 0d tensor can be broadcasted to basically any other tensor shape)

* or, when used in an tensor operation, auto-converted into a scalar (e.g. replicated).

Hi @fschlimb, I meant the MPI Bcast, sorry for the confusion.

@ClaudiaComito Yes I know.

I think we're mixing things up a bit, so let me take one step back.

Yep, sorry, I mentioned the broadcasting semantics only to clarify that depending on how the 0d array is used in a specific situation different implementations might be considered. One of the options would be to physically expand the 0d tensor according to numpy's broadcasting rules. MPI_bcast is another and send/recv a third.

The metadata of a DNDarray have to be consistent with its global shape and local shapes of the process-local tensors, among other things. If I slice one element off along the split axis, I also "squeeze" that axis so I lose the split dimension:

>>> a = ht.ones((5, 6), split=0)
>>> a.split
0
>>> a.lshape
[0/2] (3, 6)
[1/2] (2, 6)
>>> a[0].lshape  # this step involves MPI_Bcast from rank 0!
[0/2]  (6,)
[1/2]  (6,)
>>> a[0].split
None

The communication step (MPI_Bcast) is what I understood you'd like to have a lazy implementation of. Are we on the same page up to here?

Maybe in the general case. Thanks for the explanation, I know better see the requirements on generality and consistency within HeAT. Right now I am really focused on 0d tensors (scalars). Let me continue my dry-prototyping (see above) so I better understand how my stuff/ask would fit into HeAT overall. I'll get back to you when I can formulate my request clearer.

Can you suggest a good way to determine the 'home' of the value in a 0d tensor (if not replicated, e.g. on main branch)?

Your example above works under #754 because the data (a[0]) have been communicated to all processes.

Nice!

@ClaudiaComito
Copy link
Contributor

ClaudiaComito commented May 11, 2021

@fschlimb Thanks, I also understand your problem better now.

Can you suggest a good way to determine the 'home' of the value in a 0d tensor (if not replicated, e.g. on main branch)?

a[0].create_lshape_map() will return a torch tensor of shape (MPI_WORLD.size, a[0].ndim) with the local shapes on each rank. Note that you'll get inconsistent lshapes when the local tensors are empty on the main branch. But for your purpose it should be fine.

@fschlimb
Copy link
Contributor Author

@fschlimb Thanks, I also understand your problem better now.

Can you suggest a good way to determine the 'home' of the value in a 0d tensor (if not replicated, e.g. on main branch)?

a[0].create_lshape_map() will return a torch tensor of shape (MPI_WORLD.size, a[0].ndim) with the local shapes on each rank. Note that you'll get inconsistent lshapes when the local tensors are empty on the main branch. But for your purpose it should be fine.

Yes, I had tried that, what I get is this (on all ranks):

tensor([], size=(4, 0), dtype=torch.int32)

How does this tell me where the value is?

@ClaudiaComito
Copy link
Contributor

Yes, I had tried that, what I get is this (on all ranks):

tensor([], size=(4, 0), dtype=torch.int32)

Ouch. Just out of curiosity, could you try the same on this branch? Just trying to figure out if it's a problem I fixed already, or a new one.

Thanks a lot for bringing this topic up!

@fschlimb
Copy link
Contributor Author

Ouch. Just out of curiosity, could you try the same on this branch? Just trying to figure out if it's a problem I fixed already, or a new one.

Yep, same result.

@ClaudiaComito
Copy link
Contributor

ClaudiaComito commented May 11, 2021

Actually, create_lshape_map() returns the correct result, as a[0] is a scalar in this specific case, so the lshape is [].

As an example a[0:1].create_lshape_map() returns

tensor([[1],
        [0]], dtype=torch.int32)

In this case you would know that the sliced element is on rank 0, and rank 1 is "empty". But indeed, if you have a scalar, you can't recover the rank via the lshape map. Indeed we would have to add this info to the metadata.

For the time being you can calculate it though, check out ht.communication.counts_displs_shape().

@coquelin77
Copy link
Member

@ClaudiaComito Would it be possible to do the data-replication for 0d-tensors on-demand/lazily?

I think that this would be possible, but it would require a special bit of trickery. I think that the best way to do it would be to hide this from the user. The easiest would be to wait until it is called. In my head, im picturing this as a class which would call the comm.Wait() the first time that the data is requested. But i think that this would require more dedicated development.

@coquelin77
Copy link
Member

i have addressed the create_lshape_map() bug on the branch mentioned

@coquelin77
Copy link
Member

i have addressed the create_lshape_map() bug on the branch mentioned

I modified the behavior to show a [1] when the DNDarray has a constant on the process. Since we use the [] to indicate an empty process, using the same for constants is bound to lead to confusion.

also, I think that we should distribute the result if we know that the output shape is a constant, regardless of the key type for the getitem

@coquelin77
Copy link
Member

coquelin77 commented May 11, 2021

@ClaudiaComito ive changed a few things in the getitem function. now if there is only a single element as the result then it will bcast it for both int and slices. however i did not do this for advanced indexing.

EDIT: after a small amount of testing, it looks like advanced indexing does what its supposed to already.

@ClaudiaComito ClaudiaComito linked a pull request May 17, 2021 that will close this issue
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants