Skip to content
This repository has been archived by the owner on Jan 3, 2023. It is now read-only.

We should use cosine similarity and not euclidian one when computing nearest neighbours #148

Open
alexgarel opened this issue Sep 6, 2022 · 5 comments

Comments

@alexgarel
Copy link
Member

alexgarel commented Sep 6, 2022

What

The embeddings we use is generating a metric space based on cosine similarity.
While computing nearest neighbours, we use euclidian distance.
Change this to use cosine similarity.

@alexgarel
Copy link
Member Author

Note that during data for good, @tcayla has done some test on it and confirmed it's a cosine similarity.

@raphael0202
Copy link
Collaborator

I don't get what you mean by "The embeddings we use is generating a metric space based on cosine similarity." Can you develop?

@alexgarel
Copy link
Member Author

When you generate an embedding, you are generating a space with a learned metric (this is another way to see it).
So you should consider the right metric.

It is well known that in word2vec, the metric is cosine similarity and not euclidean distance (and it's quite different, especially if your vectors are not normalized, but even if they are normalized, and the gap widen with the number of dimensions).

I strongly suspected the logo embedding space to also be based on cosine similarity (it seems to me this is the case for most modern embedding algorithm based on training a NN and taking last hidden layer activation). @tcayla did some experiments and told me that it was indeed the case.

Note that cosine similarity has the drawback of not respecting the triangular inequality, and so nearest approx method may fail, but in practice this is not significant (to respect triangular inequality one should normalize vectors and switch to arctan distance, but that's really not worth it)

@raphael0202
Copy link
Collaborator

Do you have any references on this matter?
It's true that cosine similarity is the best metric when using word2vec embeddings, but I had never heard of any learned metric associated with an embedding space. Besides the case a model explicitly uses a cosine similarity metric (such as some Siamese networks or the CLIP model for example), I can't really see why cosine similarity would be more appropriate than euclidian distance when using negative log likelihood losses during training.
From my experience, cosine similarity and euclidian distance gives really similar results on KNN on embeddings obtained from classic computer vision models (ResNet,...).

@alexgarel
Copy link
Member Author

@raphael0202 I might be wrong, but as I told you it was a supposition and @tcayla seemed to confirm it from some experiments.

From my understanding, metric learning is another word for creating a space of similarities :-) And here embedding is about that (we are matching logo for how they look alike in real life, that is we learn a metric between the real objects represented by those images). Also an internet search on "metrics learning and embedding" might convince you that there is a strong link (but I have no synthetic document on it, and I may be wrong also, this is more my intuition about it).

For cosine similarity also I don't have a synthetic reference, and I may be wrong or inaccurate. Although there seem to be an advocacy for cosine similarity in high dimension (see https://stats.stackexchange.com/questions/29627/euclidean-distance-is-usually-not-good-for-sparse-data-and-more-general-case)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Development

No branches or pull requests

2 participants