Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement the side keyword argument as in numpy #1

Open
baldassarreFe opened this issue Oct 16, 2019 · 12 comments
Open

Implement the side keyword argument as in numpy #1

baldassarreFe opened this issue Oct 16, 2019 · 12 comments
Assignees

Comments

@baldassarreFe
Copy link
Contributor

The Numpy version of searchsorted takes an additional parameter to specify the semantic of the returned index:

side returned index i satisfies
left a[i-1] < v <= a[i]
right a[i-1] <= v < a[i]

The current implementation of searchsorted for Pytorch follows the left behavior, is it possible to implement the right version to allow ordering from the right?

@aliutkus
Copy link
Owner

hi, could you check out the latest version of master ?

@baldassarreFe
Copy link
Contributor Author

Hi, just checked.

Here are a few comments about the code:

In the pytest test here, why do we have separate Ba, Bv and then skip the test if Ba != Bv? Couldn't the test iterate over a single value B for the batch dimension? Like:

B =[1, 100, 200]
A_val = [1, 50, 500]
V_val = [1, 12, 120]
side_val = ['left', 'right']
nrepeat = 100
@pytest.mark.parametrize('B,A,V,side', product(Ba_val, Bv_val, A_val, V_val, side_val))
def test_searchsorted_correct(B, A, V, side, device):
    for test in range(nrepeat):
        a = torch.sort(torch.rand(B, A, device=device), dim=1)[0]
        v = torch.rand(B, V, device=device)
        ....

I see the function numpy_searchsorted was moved to the main source folder, but it was meant as a test-only utility. That's why it was in test/utils.py and not in src: it was available to pytest at test time, but not available anywhere else. Since this repo is a pytorch implementation of searchsorted, I don't think the numpy version should be part of the main code, in particular it shouldn't be exposed in src/torchsearchsorted/__init__.py as it could confuse the user.

For testing the results in examples/test.py we should use torch.testing.assert_allclose instead of using the norm, also because as you noticed the torch.norm expects float tensors instead of long indexes. Also, is this test necessary now that we have unit tests in place? Maybe the examples could simply showcase how you use the searchsorted function.


Regarding the implementation of side, did you manage to get pytest to succeed? On my machine, it runs some of the tests successfully, then on one test it hangs forever while appearing busy.

If you run pytest -v you'll see it hangs on different test cases, sometimes it's
test/test_searchsorted.py::test_searchsorted_correct[cpu-100-100-50-120-right]
other times it's
test/test_searchsorted.py::test_searchsorted_correct[cpu-200-200-500-120-right]

My guess is that some bug in the code for side=right results in the binary search loop getting stuck forever. I'll have to look into that more carefully though.

@ncourty
Copy link

ncourty commented Oct 23, 2019

Hello I am having similar troubles with the side parameter. I am trying to get a minimal code for reproducing the bug but it seems the trouble is happening randomly.

@aliutkus
Copy link
Owner

aliutkus commented Oct 23, 2019

Thanks for your feedback

concerning this:

why do we have separate Ba, Bv

This is because actually one from a or b can have 1 as a first dimension while the other is larger, which was not tested. However, they should not have different first dimensions if different from 1. You forgot these in your tests, as well as in your numpy_searchsorted actually, (hence the sel subfunction there now).

I however realize that the test code is indeed wrong, because it should be:

if Ba>1 and Bv>1 and Ba != Bv:
        return

This is fixed now.

Concerning the placement of this numpy_searchsorted in the main package, I thought it may be useful to several people for testing. Since there is anyways a dependency of torch on numpy (isnt't there?) I thought it would not be harmful,

Now, concerning the tests:

  • when it fails, as I now mention in the README.md:

    In some cases, the results vary from numpy's version. However, as far as I could see, this only happens when values are equal, which means we actually don't care about the order in which this value is added. I decided not to care about this also.

    I investigated this, and this happens when the value to insert and the array in which to insert are equal. It actually means that the result is right, just different from numpy. Do we really care in that case ?

  • when it stops forever: it's true, I also observe it actually, will investigate

what do you think ?

@aliutkus
Copy link
Owner

ok, I found out the error in the endless loop:  this was happening when the last element of a was equal to the value to insert. This should be fixed. Now investigating whether it's worth/easy to fix the discrepancies with numpy

@ncourty
Copy link

ncourty commented Oct 23, 2019

Great ! Your last update solved the bug on my side. Thanks for the nice code 👍

@aliutkus
Copy link
Owner

no pb, sorry for this, I shouldn't have merged with master so early

@baldassarreFe
Copy link
Contributor Author

baldassarreFe commented Oct 26, 2019

Well, I argue that we do care about what happens in case the value to insert is already present in the array. The whole point of the side parameter is to decide where to insert that value in case of this ambiguity. Let me first show you some examples and then I'll explain my use case:

If the value is not preset in the array, so both sides give the same answer:

>>> a = np.array([1, 2, 3, 5, 6, 7])
>>> v = 4
>>> np.searchsorted(a, v, side='left')
3
>>> np.searchsorted(a, v, side='right')
3

With side=left the returned index will place the new value to the left of identical values present in the array:

>>> a = np.array([1, 2, 3, 3, 3, 3, 3, 3, 6, 7])
>>> v = 3
>>> np.searchsorted(a, v, side='left')
2

On the contrary, side=right will give an insertion index that places the new value on the right of identical values present in the array:

>>> a = np.array([1, 2, 3, 3, 3, 3, 3, 3, 6, 7])
>>> v = 3
>>> np.searchsorted(a, v, side='left')
8

Now, for my use case. I use searchsorted to bucketize a series of real-valued distances into bins of unequal size. E.g. the distance .3 must end up in the first bin [0, 1) and the distance 3.4 must end up in the second bin [1, 5) etc. The bin indexes then are one-hot encoded and become input features for a model.

The bounds of these buckets are represented like this, with the last bucket being [100, +inf)]:

buckets = [0, 1, 5, 10, 20, 100]

Therefore, for these input distances:

dist = [0, .3, 4, 4.5, 5, 6.8, 23.4, 123, 401]

I expect this output:

idxs = [0, 0, 1, 1, 2, 2, 4, 5, 5]

Which I can get by doing

>>> np.searchsorted(buckets, dist, side='right') - 1
[0, 0, 1, 1, 2, 2, 4, 5, 5]

While doing this would give the wrong index for items on the bucket boundaries:

>>> np.searchsorted(buckets, dist, side='left')
[0, 1, 2, 2, 2, 3, 5, 6, 6]

@baldassarreFe
Copy link
Contributor Author

Ah about the broadcasting behavior for the batch dimension. I understand now what that test and the sel function were about. I think it could be written more clearly by using numpy's broadcast_arrays and pytorch's broadcast_tensors. Those utility functions take care of the broadcasting for us and are tested in the respective projects.

I can update the python code where needed, should be quite quick. Then some time next week I'll have a look at the C++ and CUDA parts to address the side logic ;)

@aliutkus
Copy link
Owner

ha! thanks for the explanations! ok, feel free to update the code and make a new PR when ready

@baldassarreFe
Copy link
Contributor Author

I have a working version of this in my fork. I rewrote most of the search code, which should now be more readable and possibly faster. Do you want to check it out?

I'll send a PR very soon, just want to write a benchmark script ;)

@aliutkus
Copy link
Owner

Hi @baldassarreFe

ok, cool, thanks for this !
Please make sure that you start from a fresh version before doing a PR !
best

antoine

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants