Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add patch to support multi vector in faiss #1358

Merged
merged 1 commit into from
Dec 27, 2023

Conversation

heemin32
Copy link
Collaborator

@heemin32 heemin32 commented Dec 20, 2023

Description

Add patch to support multi vector in faiss

Introduced an extension point in the way of collecting search result in HNSW algorithm. ResultCollectorFactory is passed as part of a search parameter. For each search inside faiss library, it create a new ResultCollector using the factory. The ResultCollector's collect method is used to collect search data.

This enable us to pass custom ResultCollector with which we can dedupe vector on its parent Id so that exact k result of original document can be returned for nested field.

Issues Resolved

#1065

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed as per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link

codecov bot commented Dec 21, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (2e3ab95) 85.15% compared to head (f9fe53b) 85.04%.
Report is 3 commits behind head on feature/multi-vector.

❗ Current head f9fe53b differs from pull request most recent head 4c936f9. Consider uploading reports for the commit 4c936f9 to get more accurate results

Additional details and impacted files
@@                    Coverage Diff                     @@
##             feature/multi-vector    #1358      +/-   ##
==========================================================
- Coverage                   85.15%   85.04%   -0.12%     
- Complexity                   1216     1248      +32     
==========================================================
  Files                         160      162       +2     
  Lines                        4958     5101     +143     
  Branches                      457      477      +20     
==========================================================
+ Hits                         4222     4338     +116     
- Misses                        538      557      +19     
- Partials                      198      206       +8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@heemin32 heemin32 added backport 2.x v2.12.0 Refactoring Improve the design, structure, and implementation while preserving its functionality labels Dec 21, 2023
ryanbogan
ryanbogan previously approved these changes Dec 21, 2023
Copy link
Member

@ryanbogan ryanbogan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@jmazanec15
Copy link
Member

Im wondering if for patches we can raise a PR against our personal forks of faiss and link them in the PR so we can review the code changes as opposed to review via git diff. Not sure if there is a better approach. What do you think @heemin32 ?

@@ -38,6 +38,16 @@ jobs:
with:
submodules: true

# Git functionality in CMAKE file does not work with given ubuntu image. Therefore, handling it here.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is meant by this? If we do this, everytime we add a patch we will need to update this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. Git command in CMAKE file like git submodule update --init -- or git apply does not work for ubuntu. Need to talk with @peterzhuamazon to see if we can fix it by using different ubuntu image.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@heemin32 what was the conclusion for this?

Copy link
Collaborator Author

@heemin32 heemin32 Dec 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code is needed as of now to make it work. Will talk to @peterzhuamazon and create an issue in opensearch build repo to resolve the problem outside of this PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is that discussion blocking this PR?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It won't.

cd jni/external/faiss
git config --global core.autocrlf input
git apply --ignore-space-change --ignore-whitespace --3way ../../patches/faiss/0001-Custom-patch-to-support-multi-vector.patch
rm ../../patches/faiss/0001-Custom-patch-to-support-multi-vector.patch
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented in line #42. # Deleting file at the end to skip git apply inside CMAKE file

.github/workflows/CI.yml Outdated Show resolved Hide resolved
CHANGELOG.md Outdated
@@ -28,3 +28,4 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
* Upgrade urllib to 1.26.18 [#1319](https://github.com/opensearch-project/k-NN/pull/1319)
* Upgrade guava to 32.1.3 [#1319](https://github.com/opensearch-project/k-NN/pull/1319)
### Refactoring
* Add patch to support multi vector in faiss [#1358](https://github.com/opensearch-project/k-NN/pull/1358)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a refactor or a feature?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put it as refactoring because this PR itself does not provide any new feature to users. However, I added CHANGELOG because it might have a impact in performance of current feature and we want to catch the root cause easily if that happen.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@heemin32 what I would say lets create a feature branch for this change and add all the changes in that feature branch. Then apply a single commit in main branch as part of 1 PR. This will ensure that this feature is packed in 1 PR. This is similar to what I did for Faiss Filtering also. It can avoid comments like this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to feature branch.

@@ -79,7 +86,7 @@ list(APPEND TARGET_LIBS ${TARGET_LIB_COMMON})
# ---------------------------------- NMSLIB ----------------------------------
if (${CONFIG_NMSLIB} STREQUAL ON OR ${CONFIG_ALL} STREQUAL ON OR ${CONFIG_TEST} STREQUAL ON)
# Check if nmslib exists
find_path(NMS_REPO_DIR NAMES similarity_search PATHS ${CMAKE_CURRENT_SOURCE_DIR}/external/nmslib)
find_path(NMS_REPO_DIR NAMES similarity_search PATHS ${CMAKE_CURRENT_SOURCE_DIR}/external/nmslib NO_DEFAULT_PATH)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In mac, I have faiss in /opt/homebrew/share and the command fail to pull faiss submodule. I don't have nmslib but thought that there is no harm on doing same for nmslib.

# If it exists, apply patches
if (EXISTS ${PATCH_FILE})
message(STATUS "Applying custom patches.")
execute_process(COMMAND git apply --ignore-space-change --ignore-whitespace --3way ${CMAKE_CURRENT_SOURCE_DIR}/patches/faiss/0001-Custom-patch-to-support-multi-vector.patch WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}/external/faiss ERROR_VARIABLE ERROR_MSG RESULT_VARIABLE RESULT_CODE)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we loop over patches?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There won't be many patches I hope. Also, explicit applying would be better to make sure that we are applying all intended patches.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 we should apply patches explicitly.

@@ -0,0 +1,221 @@
From 864c1abe7bdced5d306e871ea2bd73e1e35987fd Mon Sep 17 00:00:00 2001
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a README about patches? Or update DEV guide? Want to make sure we know what we are doing on lib upgrades.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me add in DEV guide on how to generate patches or how to update patches.

@heemin32
Copy link
Collaborator Author

heemin32 commented Dec 21, 2023

Im wondering if for patches we can raise a PR against our personal forks of faiss and link them in the PR so we can review the code changes as opposed to review via git diff. Not sure if there is a better approach. What do you think @heemin32 ?

Good Idea. Added in PR description.
Adding here as well for your convenience. https://github.com/heemin32/faiss/pull/1/files

@heemin32 heemin32 changed the base branch from main to feature/multi-vector December 22, 2023 06:26
@heemin32 heemin32 removed Refactoring Improve the design, structure, and implementation while preserving its functionality backport 2.x labels Dec 22, 2023
@heemin32 heemin32 requested a review from jmazanec15 December 22, 2023 18:20
@heemin32 heemin32 force-pushed the patch branch 2 times, most recently from de7e818 to 5e8c3eb Compare December 22, 2023 22:18
@heemin32
Copy link
Collaborator Author

Test for patch completed successfully. facebookresearch/faiss#3178

@navneet1v
Copy link
Collaborator

Overall code looks good to me. Please resolve all the comments before I can approve the PR

@heemin32 heemin32 merged commit 8631654 into opensearch-project:feature/multi-vector Dec 27, 2023
26 of 45 checks passed
heemin32 added a commit to heemin32/k-NN that referenced this pull request Jan 5, 2024
heemin32 added a commit to heemin32/k-NN that referenced this pull request Jan 16, 2024
heemin32 added a commit that referenced this pull request Jan 16, 2024
heemin32 added a commit that referenced this pull request Jan 19, 2024
* Add patch to support multi vector in faiss (#1358)

Signed-off-by: Heemin Kim <[email protected]>

* Initialize id_map as null (#1363)

Signed-off-by: Heemin Kim <[email protected]>

* Add support of multi vector in jni (#1364)

Signed-off-by: Heemin Kim <[email protected]>

* Multi vector support for Faiss HNSW (#1371)

Apply the parentId filter to the Faiss HNSW search method. This ensures that documents are deduplicated based on their parentId, and the method returns k results for documents with nested fields.

Signed-off-by: Heemin Kim <[email protected]>

* Add data generation script for nested field (#1388)

Signed-off-by: Heemin Kim <[email protected]>

* Add perf test for nested field (#1394)

Signed-off-by: Heemin Kim <[email protected]>

---------

Signed-off-by: Heemin Kim <[email protected]>
heemin32 added a commit that referenced this pull request Jan 19, 2024
heemin32 added a commit to heemin32/k-NN that referenced this pull request Jan 19, 2024
* Add patch to support multi vector in faiss (opensearch-project#1358)

Signed-off-by: Heemin Kim <[email protected]>

* Initialize id_map as null (opensearch-project#1363)

Signed-off-by: Heemin Kim <[email protected]>

* Add support of multi vector in jni (opensearch-project#1364)

Signed-off-by: Heemin Kim <[email protected]>

* Multi vector support for Faiss HNSW (opensearch-project#1371)

Apply the parentId filter to the Faiss HNSW search method. This ensures that documents are deduplicated based on their parentId, and the method returns k results for documents with nested fields.

Signed-off-by: Heemin Kim <[email protected]>

* Add data generation script for nested field (opensearch-project#1388)

Signed-off-by: Heemin Kim <[email protected]>

* Add perf test for nested field (opensearch-project#1394)

Signed-off-by: Heemin Kim <[email protected]>

---------

Signed-off-by: Heemin Kim <[email protected]>
(cherry picked from commit 709b448)
heemin32 added a commit to heemin32/k-NN that referenced this pull request Jan 19, 2024
* Add patch to support multi vector in faiss (opensearch-project#1358)

Signed-off-by: Heemin Kim <[email protected]>

* Initialize id_map as null (opensearch-project#1363)

Signed-off-by: Heemin Kim <[email protected]>

* Add support of multi vector in jni (opensearch-project#1364)

Signed-off-by: Heemin Kim <[email protected]>

* Multi vector support for Faiss HNSW (opensearch-project#1371)

Apply the parentId filter to the Faiss HNSW search method. This ensures that documents are deduplicated based on their parentId, and the method returns k results for documents with nested fields.

Signed-off-by: Heemin Kim <[email protected]>

* Add data generation script for nested field (opensearch-project#1388)

Signed-off-by: Heemin Kim <[email protected]>

* Add perf test for nested field (opensearch-project#1394)

Signed-off-by: Heemin Kim <[email protected]>

---------

Signed-off-by: Heemin Kim <[email protected]>
(cherry picked from commit 709b448)
heemin32 added a commit that referenced this pull request Jan 19, 2024
* Add patch to support multi vector in faiss (#1358)

Signed-off-by: Heemin Kim <[email protected]>

* Initialize id_map as null (#1363)

Signed-off-by: Heemin Kim <[email protected]>

* Add support of multi vector in jni (#1364)

Signed-off-by: Heemin Kim <[email protected]>

* Multi vector support for Faiss HNSW (#1371)

Apply the parentId filter to the Faiss HNSW search method. This ensures that documents are deduplicated based on their parentId, and the method returns k results for documents with nested fields.

Signed-off-by: Heemin Kim <[email protected]>

* Add data generation script for nested field (#1388)

Signed-off-by: Heemin Kim <[email protected]>

* Add perf test for nested field (#1394)

Signed-off-by: Heemin Kim <[email protected]>

---------

Signed-off-by: Heemin Kim <[email protected]>
(cherry picked from commit 709b448)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants