-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] Proper calculation of distances #2454
Conversation
I was wondering what's this year's big project after last year's trees. Looks like a logical continuation. 👍 |
The code is far from done yet. You can comment on it, but at this moment I wish if somebody could take a look at @nikicc, which type of sparse matrices (preferably a single one) do I need to support? |
@ajdapretnar, yes, it's logical, it's the next most annoying thing in Orange after trees (and before Bayes?). |
SciPy's csr_matrix and csc_matrix are the only one that we currently use AFAIK. I believe they support normal NumPy indexing; just that one is faster for column- and the other for row operations. A while ago, we decided that we do not make any guarantees and whoever would prefer one to the other may change it inplace on the table, so feel free to do so if this brings any speed-ups. Also, I suggest that rather than calling NumPy directly — I spotted at least |
87f282f
to
9d6bdec
Compare
@nikicc, what do you mean by not making any guarantees? Do I understand correctly that the new code for calculation of distances needs to support both types of matrices? The thing is that the Python code only covers fitting (that is, computation of means and variances needed for handling missing values and normalization). Actual computation is in Cython. I thus have to write a separate function for each matrix type x distance type x axis. Which is, well, lots of functions (eight per each matrix type). :) |
IMO: yes. In theory, you could get either CSC or CSR, though you will mostly get a CSR.
I see. But couldn't you always cast it either to CSR (or CSC, whichever is faster for a given axis parameter) before passing it to Cython? |
I can't get distances.md to render well pretty much anywhere. How can I render it properly? Where will this be shown? |
@ajdapretnar, open it in, say, haroopad. This file is just for core developers. Parts of the text, but not entire derivations, may be eventually included in documentation. |
Fails on Python 3.5 with
It seems unrelated to my PR. @jerneju, you've added this test in #2392. Could you please check how this PR interferes with it? Could there be a problem that sklearn's implementation of logistic regression is unstable and doesn't necessarily produce the same model on the same data? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we see some performance benchmarks? Like, how many times worse off is distance computation now? 😆
Orange/distance/_distance.pyx
Outdated
for row2 in range(n_rows2 if two_tables else row1): | ||
d = 0 | ||
for col in range(n_cols): | ||
if mads[col] == -2: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have magic names?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean?
Orange/distance/_distance.pyx
Outdated
val1, val2 = x1[row1, col], x2[row2, col] | ||
if npy_isnan(val1) and npy_isnan(val2): | ||
d += dist_missing2[col] | ||
elif mads[col] == -1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Magic + moving out of inner loop should give some performance benefit.
Orange/distance/_distance.pyx
Outdated
d += dist_missing[col, ival1] | ||
elif ival1 != ival2: | ||
d += 1 | ||
elif normalize: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be moved out of all the loops to benefit performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I considered it, but I can't without replicating the case above, which handles discrete data. Getting rid of one if
out of many, at a cost of basically duplicating the function is not worth it.
But now I see I could do that in euclidean_cols
and manhattan_cols
. Not sure I will, though; it's like splitting the function into two.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modern microprocessors tend to have quite long pipelines so that the branch misprediction delay is between 10 and 20 clock cycles.
You might put any duplications into small helper functions which will be inlined by the compiler.
The speed depends on how and what you measure. It's trivial to be fast if all you do is calculating is dot products. If we add the time needed to prepare the data on which you can then just compute the dot product, it gets way slower. This is however beyond the point since we could not properly handle missing and discrete data with sklearn's metrics. If needed, we can still use sklearn for continuous data without missing values. Or handle this ourselves, as special case. |
Which reminds me: also some usability metrics would be nice. Like, how much objectively better / more correct models are now possible. From the user's, not the purist's perspective. 😺 |
Like this? :) Before: Silhouette grouped data by discrete attributes and computed data on the continuous. In other words: useful for observing silhouettes of classes and clusters, but not for all-discrete or all-numeric data After: Silhouette can be used to actually explore data Before: Good luck with trying to see mds or clustering of, say, heart disease data without ignoring discrete attrs altogether or constructing dubious preprocessing After: Orange now supports the basic unsupervised data exploration that it should have always supported The only purist thing here is adding variances when essentially imputing means. Imputation of numeric data is not a big deal and could have been done before. Properly handling discrete data and missing discrete data is not purism. Plus, I don't worry too much about speed. Computation of distances is probably negligible in comparison with whatever you plan to do with them. |
Can you back the purported advantages with more objective numbers and/or figures to pass peer review?
I know. I meant really simple benchmarks like: >>> X = np.random.random((10000, 300))
>>> from sklearn.metrics import pairwise_distances
>>> %timeit pairwise_distances(X)
2.17 s ± 40.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) If the new implementation took 40 s on such a reasonable dataset, that'd be quite a hit, no? |
I hate you. In [14]: %timeit Euclidean(t) You haven't tried this on my computer to know what time I would get... Or have you? :) I'll see what I can get by optimizing loops. In the worst case, I can remove the improved handling of missing values, but we definitely need discrete values (picture below; bees shouldn't be hairy and dolphins should). In this case, I can remove all if-s from loops. |
Right. This is what I get streaming zoo through Continuize, computing distances on indicator columns. Dolphins should have hair, but probably so should bees and houseflies and moths. And with hair being from about the same material as nails and scales and shells, so might tortoises. Informative or not, above plots are about 98% equivalent? Can you show unit nominal distances make another significant real-world difference? 🐱
Agreed, the transformations might be idempotently auto-applied as necessary. Normalization of continuous features' values needs to take part in any sane distance computation pipeline anyway, right? |
@janezd
|
Codecov Report
@@ Coverage Diff @@
## master #2454 +/- ##
==========================================
+ Coverage 74.68% 75.34% +0.66%
==========================================
Files 320 327 +7
Lines 56444 57595 +1151
==========================================
+ Hits 42157 43397 +1240
+ Misses 14287 14198 -89 |
How?! (The fact I had to ask this already shows we're doing something wrong. :)) I reached this:
I still have some room for improvement, but this should already suffice. The current implementation normalizes the data, computes distances like in sklearn (I actually copied/adapted its code) and the fixes nans due to missing data. For Manhattan (not yet in the commit I just pushed), the new implementation is about twice as fast as sklearn's.
Euclidean has one multiplication per each row x row x column, while Manhattan has subtraction and fabs. I'll shave some time off here, too. |
Running times (in seconds) on
|
Orange/distance/__init__.py
Outdated
|
||
|
||
class Distance: | ||
def __call__(self, e1, e2=None, axis=1, impute=False): | ||
# Argument types in docstrings must be in a single line(?), hence | ||
# pylint: disable=line-too-long |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In numpydoc, must be in a single logical line.
Based on #2486 to enable new tests. |
b02bb0e
to
18c9387
Compare
Issue
Orange currently uses the puny SKL functions for computation of distances, which have no normalization and don't handle nominal features and missing values.
Description of changes
Includes