Vector Embedding Distance #1042
matthewf-lyft
started this conversation in
General
Replies: 2 comments
-
There's further discussion of this here: |
Beta Was this translation helpful? Give feedback.
0 replies
-
@matthewf-lyft there's now a first cut (very beta!) version of a jar that computes cosine distance for embeddings, and example code of how to use in a splink model here |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I was wondering how I could use a vector-embedding to calculate word-distance between different data entries?
This guide is an easy starting place https://huggingface.co/blog/getting-started-with-embeddings
Vector differences can be calculated using cosine similarity and bucketed into buckets like [-1,.2),[0.2,.5),[0.5,0.7),[0.7,0.95),[.95,1),[1,1]
For different embedding you may see that similar terms get more closely bucketed together. For example: "White House" might be the same as "The White House".
I could not find anyone referencing using embeddings or vectors, in the discussion, so was wondering if anyone had figured out how to use these in Splink?
Beta Was this translation helpful? Give feedback.
All reactions