We should use cosine similarity and not euclidian one when computing nearest neighbours #148

alexgarel · 2022-09-06T07:54:06Z

What

The embeddings we use is generating a metric space based on cosine similarity.
While computing nearest neighbours, we use euclidian distance.
Change this to use cosine similarity.

Best done with Change ANN library for a better one #146

alexgarel · 2022-09-06T07:57:16Z

Note that during data for good, @tcayla has done some test on it and confirmed it's a cosine similarity.

raphael0202 · 2022-09-16T13:29:03Z

I don't get what you mean by "The embeddings we use is generating a metric space based on cosine similarity." Can you develop?

alexgarel · 2022-09-16T13:40:34Z

When you generate an embedding, you are generating a space with a learned metric (this is another way to see it).
So you should consider the right metric.

It is well known that in word2vec, the metric is cosine similarity and not euclidean distance (and it's quite different, especially if your vectors are not normalized, but even if they are normalized, and the gap widen with the number of dimensions).

I strongly suspected the logo embedding space to also be based on cosine similarity (it seems to me this is the case for most modern embedding algorithm based on training a NN and taking last hidden layer activation). @tcayla did some experiments and told me that it was indeed the case.

Note that cosine similarity has the drawback of not respecting the triangular inequality, and so nearest approx method may fail, but in practice this is not significant (to respect triangular inequality one should normalize vectors and switch to arctan distance, but that's really not worth it)

raphael0202 · 2022-09-16T14:16:53Z

Do you have any references on this matter?
It's true that cosine similarity is the best metric when using word2vec embeddings, but I had never heard of any learned metric associated with an embedding space. Besides the case a model explicitly uses a cosine similarity metric (such as some Siamese networks or the CLIP model for example), I can't really see why cosine similarity would be more appropriate than euclidian distance when using negative log likelihood losses during training.
From my experience, cosine similarity and euclidian distance gives really similar results on KNN on embeddings obtained from classic computer vision models (ResNet,...).

alexgarel · 2022-09-16T18:24:38Z

@raphael0202 I might be wrong, but as I told you it was a supposition and @tcayla seemed to confirm it from some experiments.

From my understanding, metric learning is another word for creating a space of similarities :-) And here embedding is about that (we are matching logo for how they look alike in real life, that is we learn a metric between the real objects represented by those images). Also an internet search on "metrics learning and embedding" might convince you that there is a strong link (but I have no synthetic document on it, and I may be wrong also, this is more my intuition about it).

For cosine similarity also I don't have a synthetic reference, and I may be wrong or inaccurate. Although there seem to be an advocacy for cosine similarity in high dimension (see https://stats.stackexchange.com/questions/29627/euclidean-distance-is-usually-not-good-for-sparse-data-and-more-general-case)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

We should use cosine similarity and not euclidian one when computing nearest neighbours #148

We should use cosine similarity and not euclidian one when computing nearest neighbours #148

alexgarel commented Sep 6, 2022 •

edited by teolemon

Loading

alexgarel commented Sep 6, 2022

raphael0202 commented Sep 16, 2022

alexgarel commented Sep 16, 2022

raphael0202 commented Sep 16, 2022

alexgarel commented Sep 16, 2022

We should use cosine similarity and not euclidian one when computing nearest neighbours #148

We should use cosine similarity and not euclidian one when computing nearest neighbours #148

Comments

alexgarel commented Sep 6, 2022 • edited by teolemon Loading

What

alexgarel commented Sep 6, 2022

raphael0202 commented Sep 16, 2022

alexgarel commented Sep 16, 2022

raphael0202 commented Sep 16, 2022

alexgarel commented Sep 16, 2022

alexgarel commented Sep 6, 2022 •

edited by teolemon

Loading