[ENH] Proper calculation of distances #2454

janezd · 2017-07-06T14:36:21Z

Issue

Orange currently uses the puny SKL functions for computation of distances, which have no normalization and don't handle nominal features and missing values.

Description of changes

Includes

Code changes
Tests
Documentation

ajdapretnar · 2017-07-06T14:38:41Z

I was wondering what's this year's big project after last year's trees. Looks like a logical continuation. 👍

janezd · 2017-07-06T14:42:09Z

The code is far from done yet. You can comment on it, but at this moment I wish if somebody could take a look at distances.md from this pull request (open it in an editor with a viewer that can parse LaTeX (e.g. Haroopad) or see the attached PDF). It describes the normalization and treatment of missing values for different cases and distance types. @BlazZupan?

@nikicc, which type of sparse matrices (preferably a single one) do I need to support?

distances.pdf

janezd · 2017-07-06T14:45:46Z

@ajdapretnar, yes, it's logical, it's the next most annoying thing in Orange after trees (and before Bayes?).

nikicc · 2017-07-06T17:22:48Z

@nikicc, which type of sparse matrices (preferably a single one) do I need to support?

SciPy's csr_matrix and csc_matrix are the only one that we currently use AFAIK. I believe they support normal NumPy indexing; just that one is faster for column- and the other for row operations. A while ago, we decided that we do not make any guarantees and whoever would prefer one to the other may change it inplace on the table, so feel free to do so if this brings any speed-ups.

Also, I suggest that rather than calling NumPy directly — I spotted at least np.bincount and np.nanmean — you consider functions from Orange/statistics/util.py. There are some NumPy's equivalent function that can also handle sparse data. E.g. bincount and nanmean are already there, but you might need to write others like nanvar etc.

janezd · 2017-07-10T13:40:30Z

@nikicc, what do you mean by not making any guarantees?

Do I understand correctly that the new code for calculation of distances needs to support both types of matrices? The thing is that the Python code only covers fitting (that is, computation of means and variances needed for handling missing values and normalization). Actual computation is in Cython. I thus have to write a separate function for each matrix type x distance type x axis. Which is, well, lots of functions (eight per each matrix type). :)

nikicc · 2017-07-10T14:02:24Z

Do I understand correctly that the new code for calculation of distances needs to support both types of matrices?

IMO: yes. In theory, you could get either CSC or CSR, though you will mostly get a CSR.

The thing is that the Python code only covers fitting (that is, computation of means and variances needed for handling missing values and normalization). Actual computation is in Cython. I thus have to write a separate function for each matrix type x distance type x axis. Which is, well, lots of functions (eight per each matrix type). :)

I see. But couldn't you always cast it either to CSR (or CSC, whichever is faster for a given axis parameter) before passing it to Cython?

ajdapretnar · 2017-07-14T07:42:25Z

I can't get distances.md to render well pretty much anywhere. How can I render it properly? Where will this be shown?

janezd · 2017-07-14T10:50:18Z

@ajdapretnar, open it in, say, haroopad.

This file is just for core developers. Parts of the text, but not entire derivations, may be eventually included in documentation.

janezd · 2017-07-14T11:33:05Z

Fails on Python 3.5 with

======================================================================
FAIL: test_target_with_nan (Orange.widgets.model.tests.test_owlogisticregression.TestOWLogisticRegression)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/biolab/orange3/build/travis-test/Orange/widgets/model/tests/test_owlogisticregression.py", line 113, in test_target_with_nan
    self.assertTrue(np.array_equal(coef1, coef2))
AssertionError: False is not true

It seems unrelated to my PR. @jerneju, you've added this test in #2392. Could you please check how this PR interferes with it? Could there be a problem that sklearn's implementation of logistic regression is unstable and doesn't necessarily produce the same model on the same data?

ales-erjavec · 2017-07-14T11:59:07Z

That test (along with gh-2450 probably) exposed an existing error. See gh-2470.

kernc

Can we see some performance benchmarks? Like, how many times worse off is distance computation now? 😆

kernc · 2017-07-14T15:02:57Z

Orange/distance/_distance.pyx

+            for row2 in range(n_rows2 if two_tables else row1):
+                d = 0
+                for col in range(n_cols):
+                    if mads[col] == -2:


Can we have magic names?

What do you mean?

kernc · 2017-07-14T15:06:03Z

Orange/distance/_distance.pyx

+                    val1, val2 = x1[row1, col], x2[row2, col]
+                    if npy_isnan(val1) and npy_isnan(val2):
+                        d += dist_missing2[col]
+                    elif mads[col] == -1:


Magic + moving out of inner loop should give some performance benefit.

kernc · 2017-07-14T15:06:34Z

Orange/distance/_distance.pyx

+                            d += dist_missing[col, ival1]
+                        elif ival1 != ival2:
+                            d += 1
+                    elif normalize:


Can be moved out of all the loops to benefit performance.

I considered it, but I can't without replicating the case above, which handles discrete data. Getting rid of one if out of many, at a cost of basically duplicating the function is not worth it.

But now I see I could do that in euclidean_cols and manhattan_cols. Not sure I will, though; it's like splitting the function into two.

Modern microprocessors tend to have quite long pipelines so that the branch misprediction delay is between 10 and 20 clock cycles.

You might put any duplications into small helper functions which will be inlined by the compiler.

janezd · 2017-07-14T21:02:04Z

The speed depends on how and what you measure. It's trivial to be fast if all you do is calculating is dot products. If we add the time needed to prepare the data on which you can then just compute the dot product, it gets way slower. This is however beyond the point since we could not properly handle missing and discrete data with sklearn's metrics.

If needed, we can still use sklearn for continuous data without missing values. Or handle this ourselves, as special case.

kernc · 2017-07-14T21:28:52Z

This is however beyond the point since we could not properly handle missing and discrete data with sklearn's metrics.

Which reminds me: also some usability metrics would be nice. Like, how much objectively better / more correct models are now possible. From the user's, not the purist's perspective. 😺

janezd · 2017-07-14T23:16:02Z

Like this? :)

Before: Silhouette grouped data by discrete attributes and computed data on the continuous. In other words: useful for observing silhouettes of classes and clusters, but not for all-discrete or all-numeric data

After: Silhouette can be used to actually explore data

Before: Good luck with trying to see mds or clustering of, say, heart disease data without ignoring discrete attrs altogether or constructing dubious preprocessing

After: Orange now supports the basic unsupervised data exploration that it should have always supported

The only purist thing here is adding variances when essentially imputing means. Imputation of numeric data is not a big deal and could have been done before. Properly handling discrete data and missing discrete data is not purism.

Plus, I don't worry too much about speed. Computation of distances is probably negligible in comparison with whatever you plan to do with them.

kernc · 2017-07-14T23:53:40Z

Can you back the purported advantages with more objective numbers and/or figures to pass peer review?

I don't worry too much about speed.

I know. I meant really simple benchmarks like:

>>> X = np.random.random((10000, 300))
>>> from sklearn.metrics import pairwise_distances
>>> %timeit pairwise_distances(X)
2.17 s ± 40.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

If the new implementation took 40 s on such a reasonable dataset, that'd be quite a hit, no?

janezd · 2017-07-15T09:33:52Z

If the new implementation took 40 s on such a reasonable dataset, that'd be quite a hit, no?

I hate you.

In [14]: %timeit Euclidean(t)
1 loop, best of 3: 40 s per loop

You haven't tried this on my computer to know what time I would get... Or have you? :)

I'll see what I can get by optimizing loops. In the worst case, I can remove the improved handling of missing values, but we definitely need discrete values (picture below; bees shouldn't be hairy and dolphins should). In this case, I can remove all if-s from loops.

kernc · 2017-07-17T22:28:11Z

Right.

This is what I get streaming zoo through Continuize, computing distances on indicator columns.

Dolphins should have hair, but probably so should bees and houseflies and moths. And with hair being from about the same material as nails and scales and shells, so might tortoises. Informative or not, above plots are about 98% equivalent? Can you show unit nominal distances make another significant real-world difference? 🐱

without ignoring discrete attrs altogether or constructing dubious preprocessing

Agreed, the transformations might be idempotently auto-applied as necessary. Normalization of continuous features' values needs to take part in any sane distance computation pipeline anyway, right?

jerneju · 2017-07-18T09:01:19Z

@janezd
Python 3.5.2
test_target_with_nan which has been added in #2392 reports this:

Orange/data/table.py:449: VisibleDeprecationWarning: boolean index did not match indexed array along dimension 0; dimension is 150 but corresponding boolean dimension is 145
  self.ids = np.array(source.ids[row_indices])

codecov-io · 2017-07-18T09:25:00Z

Codecov Report

Merging #2454 into master will increase coverage by 0.66%.
The diff coverage is 95.68%.

@@            Coverage Diff             @@
##           master    #2454      +/-   ##
==========================================
+ Coverage   74.68%   75.34%   +0.66%     
==========================================
  Files         320      327       +7     
  Lines       56444    57595    +1151     
==========================================
+ Hits        42157    43397    +1240     
+ Misses      14287    14198      -89

janezd · 2017-07-18T09:36:42Z

This is what I get streaming zoo through Continuize, computing distances on indicator columns.

How?! (The fact I had to ask this already shows we're doing something wrong. :))

I reached this:

>>> %timeit pairwise_distances(X)
2.02 s ± 49.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit Euclidean(t)
2.69 s ± 33.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I still have some room for improvement, but this should already suffice. The current implementation normalizes the data, computes distances like in sklearn (I actually copied/adapted its code) and the fixes nans due to missing data.

For Manhattan (not yet in the commit I just pushed), the new implementation is about twice as fast as sklearn's.

In [5]: %timeit pairwise_distances(X, metric="cityblock")
27.2 s ± 468 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [6]: %timeit Manhattan(t)
16.8 s ± 508 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Euclidean has one multiplication per each row x row x column, while Manhattan has subtraction and fabs. I'll shave some time off here, too.

janezd · 2017-07-20T13:14:58Z

Running times (in seconds) on np.random.random((10000, 300)):

	sklearn	Orange
Euclidean	2.06	2.79
Cosine	1.73	2.05
Manhattan	31.3	17.1
Jaccard	45.1	25.2

kernc · 2017-07-20T15:37:52Z

Orange/distance/__init__.py



 class Distance:
-    def __call__(self, e1, e2=None, axis=1, impute=False):
+    # Argument types in docstrings must be in a single line(?), hence
+    # pylint: disable=line-too-long


In numpydoc, must be in a single logical line.

janezd · 2017-07-22T07:44:36Z

Based on #2486 to enable new tests.

janezd force-pushed the distances branch 2 times, most recently from 87f282f to 9d6bdec Compare July 10, 2017 09:35

janezd force-pushed the distances branch from 9d6bdec to 5e1abd9 Compare July 13, 2017 13:31

kernc reviewed Jul 14, 2017

View reviewed changes

janezd force-pushed the distances branch from c7f8cb4 to 54d239e Compare July 15, 2017 09:26

kernc reviewed Jul 20, 2017

View reviewed changes

janezd force-pushed the distances branch from 9e89465 to 79ce4b4 Compare July 22, 2017 07:43

janezd force-pushed the distances branch 2 times, most recently from b02bb0e to 18c9387 Compare July 22, 2017 09:05

janezd added 23 commits August 18, 2017 10:45

distance: Implement Cosine distance

426c108

distances: Fix blunders

6ae811f

distances: Remove inapplicable old tests

a45ad22

distances: Improve fallbacks to skl distances

ef9df3a

distances: Adapt OWDistances to new distance classes

cada2ee

distances: Refactoring of Pearson and Spearman

adbb988

distances: Clip cosine distances to (0, 1)

4534c14

distances: Refactor checks in OWDistances

09aca05

util: fix nanmean and nanvar

bee0112

manifold.py: update to use new distances

00dde91

distances: Add nogil where possible

2f0ffce

distances: Fix tests for OWDistance

2d18d70

distances: Make DistanceModel.axis a read-only property

e32fb0d

distances: Move attribute to the base class

4342931

OWSilhouettePlot: Remove code needed for old distances

868e733

distances: Update documentation

689b406

distances: Speed up Euclidean distance

5c0d074

distances: Speed up all distances

50d7757

distances: Lint

8072f27

distances: Compatibility with numpy 1.13

415be18

distances: docstrings

13b7b7a

distances: Refactor into separate files

62a5072

distances: Reformat old test_distances.py

890c5e1

janezd force-pushed the distances branch from 73a1899 to 890c5e1 Compare August 18, 2017 08:58

janezd added 2 commits August 18, 2017 11:04

Distances: Add distances/tests/__init__.py

5969543

distances: Convert lower to symetric only once

13ede45

astaric merged commit c3a7d1a into biolab:master Aug 31, 2017

janezd mentioned this pull request Dec 13, 2018

Hamming distance #3185

Closed

janezd mentioned this pull request Jan 24, 2019

Distances: difference between Orange2 and Orange3 #3265

Closed

janezd deleted the distances branch April 5, 2019 17:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] Proper calculation of distances #2454

[ENH] Proper calculation of distances #2454

janezd commented Jul 6, 2017 •

edited

Loading

ajdapretnar commented Jul 6, 2017

janezd commented Jul 6, 2017

janezd commented Jul 6, 2017

nikicc commented Jul 6, 2017

janezd commented Jul 10, 2017

nikicc commented Jul 10, 2017 •

edited

Loading

ajdapretnar commented Jul 14, 2017

janezd commented Jul 14, 2017

janezd commented Jul 14, 2017

ales-erjavec commented Jul 14, 2017

kernc left a comment

kernc Jul 14, 2017

janezd Jul 14, 2017

kernc Jul 14, 2017

kernc Jul 14, 2017

janezd Jul 14, 2017

kernc Jul 14, 2017

janezd commented Jul 14, 2017

kernc commented Jul 14, 2017

janezd commented Jul 14, 2017

kernc commented Jul 14, 2017

janezd commented Jul 15, 2017

kernc commented Jul 17, 2017 •

edited

Loading

jerneju commented Jul 18, 2017

codecov-io commented Jul 18, 2017 •

edited

Loading

janezd commented Jul 18, 2017

janezd commented Jul 20, 2017

kernc Jul 20, 2017

janezd commented Jul 22, 2017

[ENH] Proper calculation of distances #2454

[ENH] Proper calculation of distances #2454

Conversation

janezd commented Jul 6, 2017 • edited Loading

Issue

Description of changes

Includes

ajdapretnar commented Jul 6, 2017

janezd commented Jul 6, 2017

janezd commented Jul 6, 2017

nikicc commented Jul 6, 2017

janezd commented Jul 10, 2017

nikicc commented Jul 10, 2017 • edited Loading

ajdapretnar commented Jul 14, 2017

janezd commented Jul 14, 2017

janezd commented Jul 14, 2017

ales-erjavec commented Jul 14, 2017

kernc left a comment

Choose a reason for hiding this comment

kernc Jul 14, 2017

Choose a reason for hiding this comment

janezd Jul 14, 2017

Choose a reason for hiding this comment

kernc Jul 14, 2017

Choose a reason for hiding this comment

kernc Jul 14, 2017

Choose a reason for hiding this comment

janezd Jul 14, 2017

Choose a reason for hiding this comment

kernc Jul 14, 2017

Choose a reason for hiding this comment

janezd commented Jul 14, 2017

kernc commented Jul 14, 2017

janezd commented Jul 14, 2017

kernc commented Jul 14, 2017

janezd commented Jul 15, 2017

kernc commented Jul 17, 2017 • edited Loading

jerneju commented Jul 18, 2017

codecov-io commented Jul 18, 2017 • edited Loading

Codecov Report

janezd commented Jul 18, 2017

janezd commented Jul 20, 2017

kernc Jul 20, 2017

Choose a reason for hiding this comment

janezd commented Jul 22, 2017

janezd commented Jul 6, 2017 •

edited

Loading

nikicc commented Jul 10, 2017 •

edited

Loading

kernc commented Jul 17, 2017 •

edited

Loading

codecov-io commented Jul 18, 2017 •

edited

Loading