-
-
Notifications
You must be signed in to change notification settings - Fork 0
We should use cosine similarity and not euclidian one when computing nearest neighbours #148
Comments
Note that during data for good, @tcayla has done some test on it and confirmed it's a cosine similarity. |
I don't get what you mean by "The embeddings we use is generating a metric space based on cosine similarity." Can you develop? |
When you generate an embedding, you are generating a space with a learned metric (this is another way to see it). It is well known that in word2vec, the metric is cosine similarity and not euclidean distance (and it's quite different, especially if your vectors are not normalized, but even if they are normalized, and the gap widen with the number of dimensions). I strongly suspected the logo embedding space to also be based on cosine similarity (it seems to me this is the case for most modern embedding algorithm based on training a NN and taking last hidden layer activation). @tcayla did some experiments and told me that it was indeed the case. Note that cosine similarity has the drawback of not respecting the triangular inequality, and so nearest approx method may fail, but in practice this is not significant (to respect triangular inequality one should normalize vectors and switch to arctan distance, but that's really not worth it) |
Do you have any references on this matter? |
@raphael0202 I might be wrong, but as I told you it was a supposition and @tcayla seemed to confirm it from some experiments. From my understanding, metric learning is another word for creating a space of similarities :-) And here embedding is about that (we are matching logo for how they look alike in real life, that is we learn a metric between the real objects represented by those images). Also an internet search on "metrics learning and embedding" might convince you that there is a strong link (but I have no synthetic document on it, and I may be wrong also, this is more my intuition about it). For cosine similarity also I don't have a synthetic reference, and I may be wrong or inaccurate. Although there seem to be an advocacy for cosine similarity in high dimension (see https://stats.stackexchange.com/questions/29627/euclidean-distance-is-usually-not-good-for-sparse-data-and-more-general-case) |
What
The embeddings we use is generating a metric space based on cosine similarity.
While computing nearest neighbours, we use euclidian distance.
Change this to use cosine similarity.
The text was updated successfully, but these errors were encountered: