-
Notifications
You must be signed in to change notification settings - Fork 135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add patch to support multi vector in faiss #1358
Add patch to support multi vector in faiss #1358
Conversation
81b29e2
to
610c0dc
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## feature/multi-vector #1358 +/- ##
==========================================================
- Coverage 85.15% 85.04% -0.12%
- Complexity 1216 1248 +32
==========================================================
Files 160 162 +2
Lines 4958 5101 +143
Branches 457 477 +20
==========================================================
+ Hits 4222 4338 +116
- Misses 538 557 +19
- Partials 198 206 +8 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Im wondering if for patches we can raise a PR against our personal forks of faiss and link them in the PR so we can review the code changes as opposed to review via git diff. Not sure if there is a better approach. What do you think @heemin32 ? |
@@ -38,6 +38,16 @@ jobs: | |||
with: | |||
submodules: true | |||
|
|||
# Git functionality in CMAKE file does not work with given ubuntu image. Therefore, handling it here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is meant by this? If we do this, everytime we add a patch we will need to update this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. Git command in CMAKE file like git submodule update --init --
or git apply
does not work for ubuntu. Need to talk with @peterzhuamazon to see if we can fix it by using different ubuntu image.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@heemin32 what was the conclusion for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code is needed as of now to make it work. Will talk to @peterzhuamazon and create an issue in opensearch build repo to resolve the problem outside of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is that discussion blocking this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It won't.
cd jni/external/faiss | ||
git config --global core.autocrlf input | ||
git apply --ignore-space-change --ignore-whitespace --3way ../../patches/faiss/0001-Custom-patch-to-support-multi-vector.patch | ||
rm ../../patches/faiss/0001-Custom-patch-to-support-multi-vector.patch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why remove?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented in line #42. # Deleting file at the end to skip
git apply inside CMAKE file
CHANGELOG.md
Outdated
@@ -28,3 +28,4 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), | |||
* Upgrade urllib to 1.26.18 [#1319](https://github.com/opensearch-project/k-NN/pull/1319) | |||
* Upgrade guava to 32.1.3 [#1319](https://github.com/opensearch-project/k-NN/pull/1319) | |||
### Refactoring | |||
* Add patch to support multi vector in faiss [#1358](https://github.com/opensearch-project/k-NN/pull/1358) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a refactor or a feature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put it as refactoring because this PR itself does not provide any new feature to users. However, I added CHANGELOG because it might have a impact in performance of current feature and we want to catch the root cause easily if that happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@heemin32 what I would say lets create a feature branch for this change and add all the changes in that feature branch. Then apply a single commit in main branch as part of 1 PR. This will ensure that this feature is packed in 1 PR. This is similar to what I did for Faiss Filtering also. It can avoid comments like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to feature branch.
@@ -79,7 +86,7 @@ list(APPEND TARGET_LIBS ${TARGET_LIB_COMMON}) | |||
# ---------------------------------- NMSLIB ---------------------------------- | |||
if (${CONFIG_NMSLIB} STREQUAL ON OR ${CONFIG_ALL} STREQUAL ON OR ${CONFIG_TEST} STREQUAL ON) | |||
# Check if nmslib exists | |||
find_path(NMS_REPO_DIR NAMES similarity_search PATHS ${CMAKE_CURRENT_SOURCE_DIR}/external/nmslib) | |||
find_path(NMS_REPO_DIR NAMES similarity_search PATHS ${CMAKE_CURRENT_SOURCE_DIR}/external/nmslib NO_DEFAULT_PATH) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In mac, I have faiss in /opt/homebrew/share
and the command fail to pull faiss submodule. I don't have nmslib but thought that there is no harm on doing same for nmslib.
# If it exists, apply patches | ||
if (EXISTS ${PATCH_FILE}) | ||
message(STATUS "Applying custom patches.") | ||
execute_process(COMMAND git apply --ignore-space-change --ignore-whitespace --3way ${CMAKE_CURRENT_SOURCE_DIR}/patches/faiss/0001-Custom-patch-to-support-multi-vector.patch WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}/external/faiss ERROR_VARIABLE ERROR_MSG RESULT_VARIABLE RESULT_CODE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we loop over patches?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There won't be many patches I hope. Also, explicit applying would be better to make sure that we are applying all intended patches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 we should apply patches explicitly.
@@ -0,0 +1,221 @@ | |||
From 864c1abe7bdced5d306e871ea2bd73e1e35987fd Mon Sep 17 00:00:00 2001 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a README about patches? Or update DEV guide? Want to make sure we know what we are doing on lib upgrades.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me add in DEV guide on how to generate patches or how to update patches.
Good Idea. Added in PR description. |
de7e818
to
5e8c3eb
Compare
Test for patch completed successfully. facebookresearch/faiss#3178 |
Overall code looks good to me. Please resolve all the comments before I can approve the PR |
Signed-off-by: Heemin Kim <[email protected]>
8631654
into
opensearch-project:feature/multi-vector
Signed-off-by: Heemin Kim <[email protected]>
Signed-off-by: Heemin Kim <[email protected]>
Signed-off-by: Heemin Kim <[email protected]>
* Add patch to support multi vector in faiss (#1358) Signed-off-by: Heemin Kim <[email protected]> * Initialize id_map as null (#1363) Signed-off-by: Heemin Kim <[email protected]> * Add support of multi vector in jni (#1364) Signed-off-by: Heemin Kim <[email protected]> * Multi vector support for Faiss HNSW (#1371) Apply the parentId filter to the Faiss HNSW search method. This ensures that documents are deduplicated based on their parentId, and the method returns k results for documents with nested fields. Signed-off-by: Heemin Kim <[email protected]> * Add data generation script for nested field (#1388) Signed-off-by: Heemin Kim <[email protected]> * Add perf test for nested field (#1394) Signed-off-by: Heemin Kim <[email protected]> --------- Signed-off-by: Heemin Kim <[email protected]>
Signed-off-by: Heemin Kim <[email protected]>
* Add patch to support multi vector in faiss (opensearch-project#1358) Signed-off-by: Heemin Kim <[email protected]> * Initialize id_map as null (opensearch-project#1363) Signed-off-by: Heemin Kim <[email protected]> * Add support of multi vector in jni (opensearch-project#1364) Signed-off-by: Heemin Kim <[email protected]> * Multi vector support for Faiss HNSW (opensearch-project#1371) Apply the parentId filter to the Faiss HNSW search method. This ensures that documents are deduplicated based on their parentId, and the method returns k results for documents with nested fields. Signed-off-by: Heemin Kim <[email protected]> * Add data generation script for nested field (opensearch-project#1388) Signed-off-by: Heemin Kim <[email protected]> * Add perf test for nested field (opensearch-project#1394) Signed-off-by: Heemin Kim <[email protected]> --------- Signed-off-by: Heemin Kim <[email protected]> (cherry picked from commit 709b448)
* Add patch to support multi vector in faiss (opensearch-project#1358) Signed-off-by: Heemin Kim <[email protected]> * Initialize id_map as null (opensearch-project#1363) Signed-off-by: Heemin Kim <[email protected]> * Add support of multi vector in jni (opensearch-project#1364) Signed-off-by: Heemin Kim <[email protected]> * Multi vector support for Faiss HNSW (opensearch-project#1371) Apply the parentId filter to the Faiss HNSW search method. This ensures that documents are deduplicated based on their parentId, and the method returns k results for documents with nested fields. Signed-off-by: Heemin Kim <[email protected]> * Add data generation script for nested field (opensearch-project#1388) Signed-off-by: Heemin Kim <[email protected]> * Add perf test for nested field (opensearch-project#1394) Signed-off-by: Heemin Kim <[email protected]> --------- Signed-off-by: Heemin Kim <[email protected]> (cherry picked from commit 709b448)
* Add patch to support multi vector in faiss (#1358) Signed-off-by: Heemin Kim <[email protected]> * Initialize id_map as null (#1363) Signed-off-by: Heemin Kim <[email protected]> * Add support of multi vector in jni (#1364) Signed-off-by: Heemin Kim <[email protected]> * Multi vector support for Faiss HNSW (#1371) Apply the parentId filter to the Faiss HNSW search method. This ensures that documents are deduplicated based on their parentId, and the method returns k results for documents with nested fields. Signed-off-by: Heemin Kim <[email protected]> * Add data generation script for nested field (#1388) Signed-off-by: Heemin Kim <[email protected]> * Add perf test for nested field (#1394) Signed-off-by: Heemin Kim <[email protected]> --------- Signed-off-by: Heemin Kim <[email protected]> (cherry picked from commit 709b448)
Description
Add patch to support multi vector in faiss
Introduced an extension point in the way of collecting search result in HNSW algorithm.
ResultCollectorFactory
is passed as part of a search parameter. For each search inside faiss library, it create a newResultCollector
using the factory. TheResultCollector
'scollect
method is used to collect search data.This enable us to pass custom ResultCollector with which we can dedupe vector on its parent Id so that exact k result of original document can be returned for nested field.
Issues Resolved
#1065
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.