-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Backport Faiss-based vector format to 10.x #14843
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
# Conflicts: # lucene/CHANGES.txt
Also included #14847 in this PR |
Did you try running the new monster Faiss test too @kaivalnp? |
@mikemccand it succeeded a few times, but I ran into a failure after multiple runs: Command (I'm on JDK21): gradlew :lucene:sandbox:test --tests "org.apache.lucene.sandbox.codecs.faiss.TestFaissKnnVectorsFormat.testLargeVectorData" -Ptests.heapsize=16g -Ptests.jvms=12 "-Ptests.jvmargs=-XX:TieredStopAtLevel=1 -XX:+UseParallelGC -XX:ActiveProcessorCount=1" -Ptests.seed=6FE1439CE22DB5DA -Ptests.useSecurityManager=true -Ptests.monster=true -Ptests.gui=false -Ptests.file.encoding=UTF-8 -Ptests.vectorsize=256 -Ptests.forceintegervectors=true -Ptests.faiss.run=true Error:
The codec footer check seems to fail on the KnnVectorsFormat format =
new Lucene99FlatVectorsFormat(FlatVectorScorerUtil.getLucene99FlatVectorsScorer()); and the test still fails, so it seems unrelated / pre-existing? FYI I see the same failure on |
Egads, OK, this does look like a pre-existing issue. Can you open a spinoff issue so we can get to the bottom of that? If I'll try to review this PR soon! Thank you for working out the backport. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great @kaivalnp. It looks like a rote backport, except for the sandbox mrjar addition. Since main
is happy with the Faiss codec, and it's baked ~3 weeks now, I'll merge this in a day or so. (It is Prime Day over here, kinda busy for Amazon Search team heh).
I am not fully happy with this, so I would have given a -1 for the merge. Now it is too late! I am very busy at moment so I wasn't aber to review the code earlier! IMHO, we should have waited and better release main branch with this as Lucene 11. I'd suggest (see my talk @ buzzwords) to release Lucene 11 a few months after the release of Java 25 (e.g., November 2025). We have good features like this already. There are also some code issues (like catch Throwable and not rethrowing RuntimeException|Error unmodified and instead all is wrapped in unspecifid RuntimeException.... - this is a no-go in Lucene code and also discouraged by security researchers). Please add followup PRs and cleanup the exception handling. If you catch "Throwable" alsways add a In addition, all code triggering the restricted warning should be in a single class (@rmuir already complained with TODOs). So my personal opinion: Revert revert revert. |
configure(project(":lucene:core")) { | ||
configure([ | ||
project(":lucene:core"), | ||
project(":lucene:sandbox"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unfortunately this causes also the vector API be replicated in the sandbox module.... I don't like that. It also adds useless JAR files again to the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, I'll try to reuse the apijar
from lucene.core
* | ||
* @lucene.experimental | ||
*/ | ||
public class FaissKnnVectorsFormatProvider extends KnnVectorsFormat { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this class should be named FaissKnnVectorsFormat
. It is not a provider pattern. In MMapDir this is different, there it is a provider for IndexInputs.
Instead the FaissKnnVectorsFormat in the java21 part should be package private (it is now public) and have a different name, e.g. 'FaissKnnVectorsFormatImpl'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, I've addressed this in the refactor on main
|
||
private static KnnVectorsFormat lookup(Object... args) { | ||
try { | ||
MethodHandles.Lookup lookup = MethodHandles.lookup(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have a private lookup here, so we can lookup a package private impl.
* | ||
* @lucene.experimental | ||
*/ | ||
public final class FaissKnnVectorsFormat extends KnnVectorsFormat { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make package-private
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+rename (see above)
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
org.apache.lucene.sandbox.codecs.faiss.FaissKnnVectorsFormatProvider |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename (see above)
Oh, sorry, thanks @uschindler -- I will simply revert and we can iterate some more! This is good feedback, and also applies to the But this is all in sandbox :) It's OK to be a little ... sandy! I will revert shortly. |
OK I reverted -- let's iterate some more -- @kaivalnp can you open a followon PR maybe? Thank you! Maybe start with @uschindler's feedback on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some more comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think we should not duplicate that file.
Either refer to it from the other module (seems most non-invasive) and especially compatible with main branch or move the code that accesses foreign APIs to main branch. In ideal world only the LibFaissC should access native APIs and add abstractions for all code to get rid of MemorySegment. Then we could place a provider to get the LibFaissC fclass with tha API abstractions from the core's org.apache.lucene.internal package (which is only exported to sandbox but not to public) and the KNN vector codec would reside in the sandbox package and only call methods which use public Java classes.
I would prefer also on main branch to put all restricted code in LibFaissC only.
This would also solve @rmuir's complaints about duoing "restricted" chnages to MemorySegments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only the LibFaissC should access native APIs and add abstractions for all code to get rid of MemorySegment
also solve @rmuir's complaints about duoing "restricted" chnages to MemorySegments
Makes sense, I've addressed these in the refactor!
try { | ||
return (T) handle.invokeWithArguments(args); | ||
} catch (Throwable e) { | ||
throw new RuntimeException(e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The MethodHandle is to native code so in general it can only throw errors regarding type addaption. So please change this to:
} catch (Error|RuntimeException e) {
throw e;
} catch (Throwable e) {
throw new AssertionError("Should not throw checked exceptions", e);
}
Please keep in mind that this is the slowest way to call a methodhandle. If those "calls" are done in loops affectimg performance, you should really really use invokeExact with exact types. For that you have to repeat the error handling code over and over, but it is better than the current code.
Have you looked at jextract
which can generate a stub class from a header file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
jextract
which can generate a stub class from a header file?
Yes, this would have been ideal! https://github.com/openjdk/jextract
But the C API of Faiss makes use of forward declarations which aren't supported:
__stddef_max_align_t.h:22:15: warning: Skipping __clang_max_align_nonce2 (type LongDouble is not supported)
Index_c.h:21:1: warning: Skipping FaissRangeSearchResult_H (type Declared(FaissRangeSearchResult_H) is not supported)
Index_c.h:21:1: warning: Skipping FaissRangeSearchResult (type Declared(FaissRangeSearchResult_H) is not supported)
(output when I tried to generate classes for some C headers)
allocateArrayFunctionName, | ||
MethodType.methodType(MemorySegment.class, MemoryLayout.class, long.class)); | ||
} catch (IllegalAccessException | NoSuchMethodException e) { | ||
throw new RuntimeException(e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please throw LinkageError here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The better way to solve this would be to add the Java 21 code in some static-only class to wrap calls and add another sourceset for Java 22 where you reimplement the same class with Java 22+ code. Of course we need an extract of apijar for java 22 then.
The MR-JAR magic will automatically load the correct implementation based on runtime Java version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to add a jdk-22 extract of all APIs in core package. But then we should find a way to reuse the apijar files in sandbox and not have duplicates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do, thanks!
throw new LinkageError("FaissKnnVectorsFormat is missing from JAR file", e); | ||
} catch (IllegalAccessException | NoSuchMethodException e) { | ||
throw new LinkageError("FaissKnnVectorsFormat is missing correctly typed constructor", e); | ||
} catch (Throwable t) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check how it is done here:
lucene/lucene/core/src/java/org/apache/lucene/store/MMapDirectory.java
Lines 399 to 420 in e610d0e
private static <A> MMapIndexInputProvider<A> lookupProvider() { | |
final var maxPermits = getSharedArenaMaxPermitsSysprop(); | |
final var lookup = MethodHandles.lookup(); | |
try { | |
final var cls = lookup.findClass("org.apache.lucene.store.MemorySegmentIndexInputProvider"); | |
// we use method handles, so we do not need to deal with setAccessible as we have private | |
// access through the lookup: | |
final var constr = lookup.findConstructor(cls, MethodType.methodType(void.class, int.class)); | |
try { | |
return (MMapIndexInputProvider<A>) constr.invoke(maxPermits); | |
} catch (RuntimeException | Error e) { | |
throw e; | |
} catch (Throwable th) { | |
throw new AssertionError(th); | |
} | |
} catch (NoSuchMethodException | IllegalAccessException e) { | |
throw new LinkageError( | |
"MemorySegmentIndexInputProvider is missing correctly typed constructor", e); | |
} catch (ClassNotFoundException cnfe) { | |
throw new LinkageError("MemorySegmentIndexInputProvider is missing in Lucene JAR file", cnfe); | |
} | |
} |
As said before carefully differentiate between Exception types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the original code we have a separate try/ctach around the invoke call, this is unsafe as it may catch too many exceptions from unrelated places.
So please add a try-catch around only the call.
An please use a correctly typed method handle..... This is too lazy as not statically typed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wozuld only code the multi-arg ctor here statically and delegate from no-arg ctor of this class by calling this(DEFAULT.....)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in the refactor!
I do really like this idea. See related thread as an example struggle with just a one-line PR from yesterday: #14916 (comment) In the |
Well, getting there, getting there. ;) |
Let's start a discussion on dev list about releasing and new features like this. |
import org.apache.lucene.index.SegmentWriteState; | ||
|
||
/** | ||
* Provides a Faiss-based vector format, see corresponding classes in {@code java21/} for docs! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the docs of private classes in java21 are not visible in javadocs. So need to stay here.
Thanks a lot for the review @uschindler, it was super helpful! I've segregated all "unsafe" code outside Would appreciate a review! |
Description
Backport #14178 to 10.x
Summary of changes
libfaiss_c.so
shared library from tests, enforced by the security managerI also ran the Faiss tests offline with JDK 21, 22, 23, 24 using:
..and they all passed!
Diff of relevant files
https://gist.github.com/kaivalnp/77fb274ebd3de22c500396e80bb5df3c