Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Replace scikit-learn tSNE with faster implementation #3192

Merged
merged 15 commits into from
Nov 8, 2018

Conversation

pavlin-policar
Copy link
Collaborator

@pavlin-policar pavlin-policar commented Aug 8, 2018

Issue

Scikit learn's implementation of tSNE is slow. It only supports the Barnes-Hut approximation. Adding new data to an existing embedding is not supported by and existing tSNE implementation.

Description of changes

Implement wrappers around my implementation of tSNE which includes both Barnes-Hut for small data sets and the interpolation based tSNE recently introduced which runs in linear time for larger data sets.

We can also now add new data points to an existing embedding by running optimization on those points only, w.r.t. the existing embedding.

Also, I've never packaged anything to pypi yet, and I've had a fair number of issues with this, but hopefully it's all ok now.

Includes
  • Code changes
  • Tests
  • Documentation

@pavlin-policar pavlin-policar changed the title Replace scikit learn tSNE with faster implementation [WIP] Replace scikit learn tSNE with faster implementation Aug 13, 2018
@pavlin-policar
Copy link
Collaborator Author

b646f6f fixes the Manifold Learning widget to work with the new tSNE wrappers. I've had to remove "jaccard", "mahalanobis" and "cosine" distances, since sklearn's fast neighbor search methods don't support them. The previous implementation did support them but for any reasonably sized data set, all pairwise distances had to be comptued, so this was very slow.

@codecov-io
Copy link

codecov-io commented Sep 12, 2018

Codecov Report

Merging #3192 into master will increase coverage by 0.04%.
The diff coverage is 91.57%.

@@            Coverage Diff             @@
##           master    #3192      +/-   ##
==========================================
+ Coverage   82.21%   82.25%   +0.04%     
==========================================
  Files         351      351              
  Lines       62301    62442     +141     
==========================================
+ Hits        51219    51363     +144     
+ Misses      11082    11079       -3

@@ -99,23 +99,22 @@ def test_singular_matrices(self):
re-introduced, this test is very much required.

"""
# table = Table(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can @Skip this test with the same description and uncomment the code.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, that is so much nicer.

@lanzagar lanzagar added this to the 3.16 milestone Sep 12, 2018
@lanzagar lanzagar modified the milestones: 3.16, 3.17 Sep 12, 2018
@pavlin-policar
Copy link
Collaborator Author

Travis fails for Python 3.4, which we don't support anymore, so this is not an issue. The failure occurs because the library I use for fast approximate nearest neighbor search pynndescent depends on numba, which does not support anything under Python 3.5. If depending on numba is problematic, this can be switched to a different ANN library later on.

Another issue with numba is that it does not support the parallel=True directive on 32bit systems, which we currently support. I've worked around this in the tSNE library by monkey patching (here) the @numba.njit decorator, which pynndescent uses extensively. My fix removes all numba acceleration from pynndescent, which really shoudn't be an issue because we can still fall back on exact search which uses sklearn's trees, and even still, asymptotically, pynndescent will still be faster than exact search.

In the event we add anything relying on numba (e.g. UMAP which is sometimes nicer and usually faster than even this implementation - though asymptotically both are linear in the number of points), both of these issues are certainly something to be aware of.

@pavlin-policar pavlin-policar changed the title [WIP] Replace scikit learn tSNE with faster implementation Replace scikit learn tSNE with faster implementation Sep 12, 2018
@kernc
Copy link
Contributor

kernc commented Sep 12, 2018

I've worked around this in the tSNE library by monkey patching (here) the @numba.njit decorator

Only for Python 2.7?

@pavlin-policar
Copy link
Collaborator Author

pavlin-policar commented Sep 12, 2018

uns1 = sys.platform.startswith('win32') and sys.version_info[:2] == (2, 7)

checks the python version and

uns2 = sys.maxsize <= 2 ** 32

checks whether the system is a 32bit one. These are the two cases where numba doesn't support the parallel directive. I took this directly from numba, so I really only patch when necessary. I even kept the names to make this (somewhat) obvious.

@kernc
Copy link
Contributor

kernc commented Sep 12, 2018

For some reason, I read uns1 and uns2. Nevermind. 😳

@kernc
Copy link
Contributor

kernc commented Sep 12, 2018

If that's any better, you could patch out only the parallel=True argument by:

def __njit_wrapper(*args, parallel=True, **kwargs):
    """Discards `parallel` argument"""
    return __njit_copy(*args, **kwargs)

@pavlin-policar
Copy link
Collaborator Author

pavlin-policar commented Sep 12, 2018

Yeah, I did try to be clever like that, but then other problems pop up. Apparently, the compiler for numba expects int64s and is surprised when that's not the default on 32bit systems. I didn't check whether this was baked into numba or just pynndescent, but that would have been really hard to patch up.

It ended up being a lot simpler to just remove numba acceleration. Who uses 32 bit systems anyways? 😄

@pavlin-policar
Copy link
Collaborator Author

So everything is in place, the only thing I'm still waiting on is the conda package to be properly set up. After that, this should be good to merge. Unfortunately, for the time being, conda-forge seems a bit broken.

conda-forge/staged-recipes#6659

@pavlin-policar
Copy link
Collaborator Author

I've now managed to get the package on conda-forge as well and have gotten the hang of all this. Currently, I put the requirement both into requirements-core.txt and conda-recipe/meta.yaml. Is this necessary? Should I only put it into the conda recipe?

@markotoplak
Copy link
Member

Should I only put it into the conda recipe?

Anaconda Python is just one Python distribution. Orange should work with just setup.py (or pip, but without conda) too.

@pavlin-policar
Copy link
Collaborator Author

I haven't really been able to figure out why the test was failing on Windows, so I decided to drop sparse support for t-SNE.

IMO, there is actually a valid argument for this. Sparse data usually indicates that we have high dimensional data, and it is known that t-SNE doesn't scale well with ambient dimension. High dimensional input usually leads not only to poor visualizations, but much longer runtime. It is standard (and good) practice to reduce the dimension of the data via feature selection and/or PCA prior to embedding it with t-SNE. Running any kind of data through PCA will produce a dense output, therefore it is not the worst thing in the world to drop sparse support here.

This way, we make it impossible for the user to use t-SNE in a way we know to be bad and are encouraging users to use t-SNE in a way that will produce better visualizations.

@lanzagar lanzagar changed the title Replace scikit learn tSNE with faster implementation [ENH] Replace scikit-learn tSNE with faster implementation Nov 7, 2018
@lanzagar lanzagar merged commit ee460d5 into biolab:master Nov 8, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants