-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] Replace scikit-learn tSNE with faster implementation #3192
Conversation
b646f6f fixes the Manifold Learning widget to work with the new tSNE wrappers. I've had to remove "jaccard", "mahalanobis" and "cosine" distances, since sklearn's fast neighbor search methods don't support them. The previous implementation did support them but for any reasonably sized data set, all pairwise distances had to be comptued, so this was very slow. |
3612342
to
0b43baf
Compare
Codecov Report
@@ Coverage Diff @@
## master #3192 +/- ##
==========================================
+ Coverage 82.21% 82.25% +0.04%
==========================================
Files 351 351
Lines 62301 62442 +141
==========================================
+ Hits 51219 51363 +144
+ Misses 11082 11079 -3 |
a0eb314
to
71b0375
Compare
@@ -99,23 +99,22 @@ def test_singular_matrices(self): | |||
re-introduced, this test is very much required. | |||
|
|||
""" | |||
# table = Table( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can @Skip this test with the same description and uncomment the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes, that is so much nicer.
4ccefcd
to
81b02ff
Compare
989e71c
to
ebdbe7b
Compare
Travis fails for Python 3.4, which we don't support anymore, so this is not an issue. The failure occurs because the library I use for fast approximate nearest neighbor search Another issue with numba is that it does not support the In the event we add anything relying on numba (e.g. UMAP which is sometimes nicer and usually faster than even this implementation - though asymptotically both are linear in the number of points), both of these issues are certainly something to be aware of. |
Only for Python 2.7? |
checks the python version and
checks whether the system is a 32bit one. These are the two cases where numba doesn't support the |
For some reason, I read |
If that's any better, you could patch out only the def __njit_wrapper(*args, parallel=True, **kwargs):
"""Discards `parallel` argument"""
return __njit_copy(*args, **kwargs) |
Yeah, I did try to be clever like that, but then other problems pop up. Apparently, the compiler for numba expects int64s and is surprised when that's not the default on 32bit systems. I didn't check whether this was baked into numba or just pynndescent, but that would have been really hard to patch up. It ended up being a lot simpler to just remove numba acceleration. Who uses 32 bit systems anyways? 😄 |
So everything is in place, the only thing I'm still waiting on is the conda package to be properly set up. After that, this should be good to merge. Unfortunately, for the time being, conda-forge seems a bit broken. |
1f63360
to
25b5cc6
Compare
I've now managed to get the package on conda-forge as well and have gotten the hang of all this. Currently, I put the requirement both into |
Anaconda Python is just one Python distribution. Orange should work with just setup.py (or pip, but without conda) too. |
05975e8
to
588ccbf
Compare
1551408
to
4cc0f50
Compare
I haven't really been able to figure out why the test was failing on Windows, so I decided to drop sparse support for t-SNE. IMO, there is actually a valid argument for this. Sparse data usually indicates that we have high dimensional data, and it is known that t-SNE doesn't scale well with ambient dimension. High dimensional input usually leads not only to poor visualizations, but much longer runtime. It is standard (and good) practice to reduce the dimension of the data via feature selection and/or PCA prior to embedding it with t-SNE. Running any kind of data through PCA will produce a dense output, therefore it is not the worst thing in the world to drop sparse support here. This way, we make it impossible for the user to use t-SNE in a way we know to be bad and are encouraging users to use t-SNE in a way that will produce better visualizations. |
Issue
Scikit learn's implementation of tSNE is slow. It only supports the Barnes-Hut approximation. Adding new data to an existing embedding is not supported by and existing tSNE implementation.
Description of changes
Implement wrappers around my implementation of tSNE which includes both Barnes-Hut for small data sets and the interpolation based tSNE recently introduced which runs in linear time for larger data sets.
We can also now add new data points to an existing embedding by running optimization on those points only, w.r.t. the existing embedding.
Also, I've never packaged anything to pypi yet, and I've had a fair number of issues with this, but hopefully it's all ok now.
Includes