-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added dot product everywhere were cosine similarity was used #676
Open
joancf
wants to merge
11
commits into
alexklibisz:main
Choose a base branch
from
joancf:265-dot_product
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
11 commits
Select commit
Hold shift + click to select a range
39fa610
Added dot product everywhere were cosine similarity was used
joancf a706edf
Found some bugs when trying to build/run tests
joancf 10e3487
Update docs/_posts/2021-07-30-how-does-elastiknn-work.md
joancf 39c66e3
Update docs/_posts/2021-07-30-how-does-elastiknn-work.md
joancf 61404e0
Update docs/pages/index.md
joancf 0cf5f58
Update docs/pages/api.md
joancf 60fde2a
Update docs/pages/api.md
joancf 3f835eb
Update elastiknn-models/src/main/java/com/klibisz/elastiknn/models/Do…
joancf 59dff2c
Addd changes to footnote
joancf 1028572
dotSimilarity does not return negative floats
joancf 281e6c4
zero as min value
joancf File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -292,6 +292,30 @@ PUT /my-index/_mapping | |
} | ||
} | ||
``` | ||
### Dot LSH Mapping | ||
|
||
Uses the [Random Projection algorithm](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Random_projection) | ||
to hash and store dense float vectors such that they support approximate Dot similarity queries. Equivalent to Cosine similarity if the vectors are normalized | ||
|
||
The implementation is influenced by Chapter 3 of [Mining Massive Datasets.](http://www.mmds.org/) | ||
|
||
```json | ||
PUT /my-index/_mapping | ||
{ | ||
"properties": { | ||
"my_vec": { | ||
"type": "elastiknn_dense_float_vector", # 1 | ||
"elastiknn": { | ||
"dims": 100, # 2 | ||
"model": "lsh", # 3 | ||
"similarity": "dot", # 4 | ||
"L": 99, # 5 | ||
"k": 1 # 6 | ||
} | ||
} | ||
} | ||
} | ||
``` | ||
|
||
|#|Description| | ||
|:--|:--| | ||
|
@@ -425,7 +449,7 @@ GET /my-index/_search | |
### Compatibility of Vector Types and Similarities | ||
|
||
Jaccard and Hamming similarity only work with sparse bool vectors. | ||
Cosine,[^note-angular-cosine] L1, and L2 similarity only work with dense float vectors. | ||
Cosine,[^note-angular-cosine],Dot[^note-dot-product], L1, and L2 similarity only work with dense float vectors. | ||
The following documentation assume this restriction is known. | ||
|
||
These restrictions aren't inherent to the types and algorithms, i.e., you could in theory run cosine similarity on sparse vectors. | ||
|
@@ -446,9 +470,12 @@ The exact transformations are described below. | |
|Jaccard|N/A|0|1.0| | ||
|Hamming|N/A|0|1.0| | ||
|Cosine[^note-angular-cosine]|`cosine similarity + 1`|0|2| | ||
|Dot[^note-dot-product]|`Dot similarity + 1`|0|2| | ||
|L1|`1 / (1 + l1 distance)`|0|1| | ||
|L2|`1 / (1 + l2 distance)`|0|1| | ||
|
||
Dot similirarity will produce negative scores if the vectors are not normalized | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should make sure to catch this and return an error in the plugin. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i added max(0,distance) so, it will always be positive. |
||
|
||
If you're using the `elastiknn_nearest_neighbors` query with other queries, and the score values are inconvenient (e.g. huge values like 1e6), consider wrapping the query in a [Script Score Query](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-script-score-query.html), where you can access and transform the `_score` value. | ||
|
||
### Query Vector | ||
|
@@ -621,6 +648,36 @@ GET /my-index/_search | |
|5|Number of candidates per segment. See the section on LSH Search Strategy.| | ||
|6|Set to true to use the more-like-this heuristic to pick a subset of hashes. Generally faster but still experimental.| | ||
|
||
### Dot LSH Query | ||
|
||
Retrieve dense float vectors based on approximate Dot similarity.[^note-dot-cosine] | ||
|
||
```json | ||
GET /my-index/_search | ||
{ | ||
"query": { | ||
"elastiknn_nearest_neighbors": { | ||
"field": "my_vec", # 1 | ||
"vec": { # 2 | ||
"values": [0.1, 0.2, 0.3, ...] | ||
}, | ||
"model": "lsh", # 3 | ||
"similarity": "dot", # 4 | ||
"candidates": 50 # 5 | ||
} | ||
} | ||
} | ||
``` | ||
|
||
|#|Description| | ||
|:--|:--| | ||
|1|Indexed field. Must use `lsh` mapping model with `dot`[^note-dot-product] similarity.| | ||
|2|Query vector. Must be literal dense float or a pointer to an indexed dense float vector.| | ||
|3|Model name.| | ||
|4|Similarity function.| | ||
|5|Number of candidates per segment. See the section on LSH Search Strategy.| | ||
|6|Set to true to use the more-like-this heuristic to pick a subset of hashes. Generally faster but still experimental.| | ||
|
||
### L1 LSH Query | ||
|
||
Not yet implemented. | ||
|
@@ -707,12 +764,13 @@ The similarity functions are abbreviated (J: Jaccard, H: Hamming, C: Cosine,[^no | |
|
||
#### elastiknn_dense_float_vector | ||
|
||
|Model / Query |Exact |Cosine LSH |L2 LSH |Permutation LSH| | ||
|:-- |:-- |:-- |:-- |:-- | | ||
|Exact (i.e. no model specified) |✔ (C, L1, L2) |x |x |x | | ||
|Cosine LSH |✔ (C, L1, L2) |✔ |x |x | | ||
|L2 LSH |✔ (C, L1, L2) |x |✔ |x | | ||
|Permutation LSH |✔ (C, L1, L2) |x |x |✔ | | ||
|Model / Query |Exact |Cosine LSH |Dot LSH|L2 LSH |Permutation LSH| | ||
|:-- |:-- |:-- |:-- |:-- |:-- | | ||
|Exact (i.e. no model specified) |✔ (C, D, L1, L2) |x |x |x |x | | ||
|Cosine LSH |✔ (C, D, L1, L2) |✔ |✔ |x |x | | ||
|Dot LSH |✔ (C, D, L1, L2) |✔ |✔ |x |x | | ||
|L2 LSH |✔ (C, D, L1, L2) |x |x |✔ |x | | ||
|Permutation LSH |✔ (C, D, L1, L2) |x |x |x |✔ | | ||
|
||
### Running Nearest Neighbors Query on a Filtered Subset of Documents | ||
|
||
|
@@ -860,4 +918,5 @@ PUT /my-index | |
|
||
See the [create index documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-create-index.html) for more details. | ||
|
||
[^note-angular-cosine]: Cosine similarity used to be (incorrectly) called "angular" similarity. All references to "angular" were renamed to "Cosine" in 7.13.3.2. You can still use "angular" in the JSON/HTTP API; it will convert to "cosine" internally. | ||
[^note-angular-cosine]: Cosine similarity used to be (incorrectly) called "angular" similarity. All references to "angular" were renamed to "Cosine" in 7.13.3.2. You can still use "angular" in the JSON/HTTP API; it will convert to "cosine" internally. | ||
[^note-dot-product]: Dot product is intended to be used with normalized vectors V, meaning that ||v||==1. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there might be something wrong with the table formatting:
Also, if we end up using it, we should describe the updated transformation here:
max(0, 1 + dot product)