Skip to content

Backport Faiss-based vector format to 10.x #14843

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jul 9, 2025

Conversation

kaivalnp
Copy link
Contributor

Description

Backport #14178 to 10.x

Summary of changes

  • Support for JDK21 via MR-JAR and runtime lookup (similar to other Panama-based classes here). This was tricky because some functions were renamed from JDK21 (https://openjdk.org/jeps/442) -> JDK22 (https://openjdk.org/jeps/454, where it was finalized) -- I'm using method handles to bind to the correct function at runtime
  • Allow native access to the libfaiss_c.so shared library from tests, enforced by the security manager
  • Change GH action to run on JDK21
  • Misc changes for the lower Java language level + 10.x branch

I also ran the Faiss tests offline with JDK 21, 22, 23, 24 using:

gradlew -p lucene/sandbox -Dtests.faiss.run=true test --tests "org.apache.lucene.sandbox.codecs.faiss.*"

..and they all passed!

Diff of relevant files

https://gist.github.com/kaivalnp/77fb274ebd3de22c500396e80bb5df3c

Kaival Parikh added 2 commits July 5, 2025 20:02
@kaivalnp
Copy link
Contributor Author

kaivalnp commented Jul 5, 2025

Also included #14847 in this PR

@mikemccand
Copy link
Member

Did you try running the new monster Faiss test too @kaivalnp?

@kaivalnp
Copy link
Contributor Author

kaivalnp commented Jul 6, 2025

@mikemccand it succeeded a few times, but I ran into a failure after multiple runs:

Command (I'm on JDK21):

gradlew :lucene:sandbox:test --tests "org.apache.lucene.sandbox.codecs.faiss.TestFaissKnnVectorsFormat.testLargeVectorData" -Ptests.heapsize=16g -Ptests.jvms=12 "-Ptests.jvmargs=-XX:TieredStopAtLevel=1 -XX:+UseParallelGC -XX:ActiveProcessorCount=1" -Ptests.seed=6FE1439CE22DB5DA -Ptests.useSecurityManager=true -Ptests.monster=true -Ptests.gui=false -Ptests.file.encoding=UTF-8 -Ptests.vectorsize=256 -Ptests.forceintegervectors=true -Ptests.faiss.run=true

Error:

TestFaissKnnVectorsFormat > testLargeVectorData FAILED
    org.apache.lucene.index.CorruptIndexException: misplaced codec footer (file extended?): remaining=2147483666, expected=16, fp=78 (resource=BufferedChecksumIndexInput(MockIndexInputWrapper((clone of) ByteBuffersIndexInput (file=_0_FaissKnnVectorsFormat_0.vec, buffers=2,147,483,744 bytes, block size: 33,554,432, blocks: 65, position: 0))))
        at __randomizedtesting.SeedInfo.seed([6FE1439CE22DB5DA:1BB7D7D4A612D229]:0)
        at app/[email protected]/org.apache.lucene.codecs.CodecUtil.validateFooter(CodecUtil.java:579)
        at app/[email protected]/org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:433)
        at app/[email protected]/org.apache.lucene.codecs.lucene90.Lucene90CompoundFormat.writeCompoundFile(Lucene90CompoundFormat.java:145)
        at app/[email protected]/org.apache.lucene.codecs.lucene90.Lucene90CompoundFormat.write(Lucene90CompoundFormat.java:99)
        at app/[email protected]/org.apache.lucene.index.IndexWriter.createCompoundFile(IndexWriter.java:5900)

The codec footer check seems to fail on the .vec file related to Lucene99FlatVectorsFormat (which we use to store raw vectors outside Faiss) -- to verify this I replaced the format here to use only the flat format (and not the new Faiss format):

KnnVectorsFormat format =
        new Lucene99FlatVectorsFormat(FlatVectorScorerUtil.getLucene99FlatVectorsScorer());

and the test still fails, so it seems unrelated / pre-existing?

FYI I see the same failure on main after multiple runs..

@mikemccand
Copy link
Member

FYI I see the same failure on main after multiple runs..

Egads, OK, this does look like a pre-existing issue. Can you open a spinoff issue so we can get to the bottom of that? If Lucene99FlatVectorsFormat can somehow write corrupt index files when the vectors total size is > 2.1 B, we need to (separately) fix that! I'm glad you wrote this Monster test ... it seems like it uncovered a lurker bug.

I'll try to review this PR soon! Thank you for working out the backport.

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great @kaivalnp. It looks like a rote backport, except for the sandbox mrjar addition. Since main is happy with the Faiss codec, and it's baked ~3 weeks now, I'll merge this in a day or so. (It is Prime Day over here, kinda busy for Amazon Search team heh).

@mikemccand mikemccand merged commit d7f92bc into apache:branch_10x Jul 9, 2025
6 checks passed
@uschindler
Copy link
Contributor

uschindler commented Jul 9, 2025

I am not fully happy with this, so I would have given a -1 for the merge. Now it is too late! I am very busy at moment so I wasn't aber to review the code earlier!

IMHO, we should have waited and better release main branch with this as Lucene 11. I'd suggest (see my talk @ buzzwords) to release Lucene 11 a few months after the release of Java 25 (e.g., November 2025). We have good features like this already.

There are also some code issues (like catch Throwable and not rethrowing RuntimeException|Error unmodified and instead all is wrapped in unspecifid RuntimeException.... - this is a no-go in Lucene code and also discouraged by security researchers). Please add followup PRs and cleanup the exception handling. If you catch "Throwable" alsways add a catch (RuntimeException|Error e) { throw e; }.

In addition, all code triggering the restricted warning should be in a single class (@rmuir already complained with TODOs).

So my personal opinion: Revert revert revert.

configure(project(":lucene:core")) {
configure([
project(":lucene:core"),
project(":lucene:sandbox"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unfortunately this causes also the vector API be replicated in the sandbox module.... I don't like that. It also adds useless JAR files again to the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I'll try to reuse the apijar from lucene.core

*
* @lucene.experimental
*/
public class FaissKnnVectorsFormatProvider extends KnnVectorsFormat {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this class should be named FaissKnnVectorsFormat. It is not a provider pattern. In MMapDir this is different, there it is a provider for IndexInputs.

Instead the FaissKnnVectorsFormat in the java21 part should be package private (it is now public) and have a different name, e.g. 'FaissKnnVectorsFormatImpl'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I've addressed this in the refactor on main


private static KnnVectorsFormat lookup(Object... args) {
try {
MethodHandles.Lookup lookup = MethodHandles.lookup();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have a private lookup here, so we can lookup a package private impl.

*
* @lucene.experimental
*/
public final class FaissKnnVectorsFormat extends KnnVectorsFormat {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make package-private

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+rename (see above)

# See the License for the specific language governing permissions and
# limitations under the License.

org.apache.lucene.sandbox.codecs.faiss.FaissKnnVectorsFormatProvider
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename (see above)

@mikemccand
Copy link
Member

I am not fully happy with this, so I would have given a -1 for the merge. Now it is too late! I am very busy at moment so I wasn't aber to review the code earlier!

Oh, sorry, thanks @uschindler -- I will simply revert and we can iterate some more! This is good feedback, and also applies to the main version.

But this is all in sandbox :) It's OK to be a little ... sandy!

I will revert shortly.

@mikemccand
Copy link
Member

OK I reverted -- let's iterate some more -- @kaivalnp can you open a followon PR maybe? Thank you! Maybe start with @uschindler's feedback on main (Lucene 11.x)? Then we can revisit backport?

Copy link
Contributor

@uschindler uschindler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some more comments

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think we should not duplicate that file.

Either refer to it from the other module (seems most non-invasive) and especially compatible with main branch or move the code that accesses foreign APIs to main branch. In ideal world only the LibFaissC should access native APIs and add abstractions for all code to get rid of MemorySegment. Then we could place a provider to get the LibFaissC fclass with tha API abstractions from the core's org.apache.lucene.internal package (which is only exported to sandbox but not to public) and the KNN vector codec would reside in the sandbox package and only call methods which use public Java classes.

I would prefer also on main branch to put all restricted code in LibFaissC only.

This would also solve @rmuir's complaints about duoing "restricted" chnages to MemorySegments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only the LibFaissC should access native APIs and add abstractions for all code to get rid of MemorySegment

also solve @rmuir's complaints about duoing "restricted" chnages to MemorySegments

Makes sense, I've addressed these in the refactor!

try {
return (T) handle.invokeWithArguments(args);
} catch (Throwable e) {
throw new RuntimeException(e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MethodHandle is to native code so in general it can only throw errors regarding type addaption. So please change this to:

   } catch (Error|RuntimeException e) {
     throw e;
   } catch (Throwable e) {
      throw new AssertionError("Should not throw checked exceptions", e);
   }

Please keep in mind that this is the slowest way to call a methodhandle. If those "calls" are done in loops affectimg performance, you should really really use invokeExact with exact types. For that you have to repeat the error handling code over and over, but it is better than the current code.

Have you looked at jextract which can generate a stub class from a header file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jextract which can generate a stub class from a header file?

Yes, this would have been ideal! https://github.com/openjdk/jextract

But the C API of Faiss makes use of forward declarations which aren't supported:

__stddef_max_align_t.h:22:15: warning: Skipping __clang_max_align_nonce2 (type LongDouble is not supported)
Index_c.h:21:1: warning: Skipping FaissRangeSearchResult_H (type Declared(FaissRangeSearchResult_H) is not supported)
Index_c.h:21:1: warning: Skipping FaissRangeSearchResult (type Declared(FaissRangeSearchResult_H) is not supported)

(output when I tried to generate classes for some C headers)

allocateArrayFunctionName,
MethodType.methodType(MemorySegment.class, MemoryLayout.class, long.class));
} catch (IllegalAccessException | NoSuchMethodException e) {
throw new RuntimeException(e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please throw LinkageError here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The better way to solve this would be to add the Java 21 code in some static-only class to wrap calls and add another sourceset for Java 22 where you reimplement the same class with Java 22+ code. Of course we need an extract of apijar for java 22 then.

The MR-JAR magic will automatically load the correct implementation based on runtime Java version.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to add a jdk-22 extract of all APIs in core package. But then we should find a way to reuse the apijar files in sandbox and not have duplicates.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do, thanks!

throw new LinkageError("FaissKnnVectorsFormat is missing from JAR file", e);
} catch (IllegalAccessException | NoSuchMethodException e) {
throw new LinkageError("FaissKnnVectorsFormat is missing correctly typed constructor", e);
} catch (Throwable t) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check how it is done here:

private static <A> MMapIndexInputProvider<A> lookupProvider() {
final var maxPermits = getSharedArenaMaxPermitsSysprop();
final var lookup = MethodHandles.lookup();
try {
final var cls = lookup.findClass("org.apache.lucene.store.MemorySegmentIndexInputProvider");
// we use method handles, so we do not need to deal with setAccessible as we have private
// access through the lookup:
final var constr = lookup.findConstructor(cls, MethodType.methodType(void.class, int.class));
try {
return (MMapIndexInputProvider<A>) constr.invoke(maxPermits);
} catch (RuntimeException | Error e) {
throw e;
} catch (Throwable th) {
throw new AssertionError(th);
}
} catch (NoSuchMethodException | IllegalAccessException e) {
throw new LinkageError(
"MemorySegmentIndexInputProvider is missing correctly typed constructor", e);
} catch (ClassNotFoundException cnfe) {
throw new LinkageError("MemorySegmentIndexInputProvider is missing in Lucene JAR file", cnfe);
}
}

As said before carefully differentiate between Exception types.

Copy link
Contributor

@uschindler uschindler Jul 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the original code we have a separate try/ctach around the invoke call, this is unsafe as it may catch too many exceptions from unrelated places.

So please add a try-catch around only the call.

An please use a correctly typed method handle..... This is too lazy as not statically typed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wozuld only code the multi-arg ctor here statically and delegate from no-arg ctor of this class by calling this(DEFAULT.....)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in the refactor!

@rmuir
Copy link
Member

rmuir commented Jul 9, 2025

IMHO, we should have waited and better release main branch with this as Lucene 11. I'd suggest (see my talk @ buzzwords) to release Lucene 11 a few months after the release of Java 25 (e.g., November 2025). We have good features like this already.

I do really like this idea. See related thread as an example struggle with just a one-line PR from yesterday: #14916 (comment)

In the main branch, the build system has been completely re-worked, and there is ongoing work to make it faster AND pickier: along with a lot of associated file cleanups.

@dweiss
Copy link
Contributor

dweiss commented Jul 9, 2025

In the main branch, the build system has been completely re-worked,

Well, getting there, getting there. ;)

@uschindler
Copy link
Contributor

Let's start a discussion on dev list about releasing and new features like this.

import org.apache.lucene.index.SegmentWriteState;

/**
* Provides a Faiss-based vector format, see corresponding classes in {@code java21/} for docs!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the docs of private classes in java21 are not visible in javadocs. So need to stay here.

@kaivalnp
Copy link
Contributor Author

Thanks a lot for the review @uschindler, it was super helpful!
I've taken an initial pass at refactoring some classes in main to make this backport easier like you mentioned (#14934)

I've segregated all "unsafe" code outside FaissKnnVectors{Format/Reader/Writer}, which is also loaded dynamically to make backport easier (we can directly move those files to java21/ on 10.x)

Would appreciate a review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants