-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Vectorize bitset to array #14910
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Vectorize bitset to array #14910
Conversation
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR. |
This cannot be merged without adding this to the java24 part and reoving the requires of incubator module for JMH. I assume this is only meant for quick checks and stays draft? |
Thanks for reminding!
Yes, after the code integrated into |
I managed to get some luceneutil data on AVX512
|
Some more data: Mac M2
AVX512 (mentioned above)
same AVX512 machine without
|
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR. |
This is very cool and the speedup makes sense to me. When dynamic pruning is enabled, only queries whose leading clauses are dense benefit significantly from this speedup ( Like for #14896, I'd like to split this PR in two: one where we merge your scalar improvements, and then this one where we add support for vectorization. By the way, we may want to look into other approaches for the scalar case. Since we only use bit sets in postings when many bits would be set, a linear scan should perform quite efficiently? ( |
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR. |
JMH results with the vectorized implementations:
|
Thank you for updating the benchmark. I suggest we first figure how to handle compress() on #14896 before coming back to this PR. |
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR. |
+1, I'm tracking this PR as well. |
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR. |
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR. |
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR. |
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR. |
@@ -25,6 +25,7 @@ | |||
requires org.apache.lucene.core; | |||
requires org.apache.lucene.expressions; | |||
requires org.apache.lucene.sandbox; | |||
requires jdk.incubator.vector; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove the code from benchmark module and only benchmark code in the java24 source set.
} | ||
} | ||
|
||
// NOCOMMIT remove vectorized methods and requirement on vector module before merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes this needs to go away. Our build system found this already. ❤️
@@ -39,6 +40,7 @@ int word2Array(long word, int base, int[] docs, int offset) { | |||
return intWord2Array((int) (word >>> 32), docs, offset, base + 32); | |||
} | |||
|
|||
@SuppressForbidden(reason = "Uses compress only where fast and carefully contained") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The if check as described in forbidden apis is missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The check is before instance picking, do we must check it within this method?
Lines 94 to 99 in 2068441
if (Constants.HAS_FAST_COMPRESS_MASK_CAST | |
&& PanamaVectorConstants.PREFERRED_VECTOR_BITSIZE >= 256) { | |
return PanamaBitSetUtil.INSTANCE; | |
} else { | |
return BitSetUtil.INSTANCE; | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, then it is ok. Please add this info to the @SuppressWarnings
message!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe say: The whole BitsetUtil
impl instance is only used when HAS_FAST_COMPRESS_MASK_CAST
is enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just want to make sure that Robert does not complain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just my final comment: The setup on how the instances for look fine, although I would have preferred an interface instead of subclassing.
So BitSetUtil
as public (but restricted) interface, both DefaultBitSetUtil
as non-vectorized implementation and PanamaBitSetUtil
as another implementation, both package private.
@@ -206,7 +208,8 @@ private static Optional<Module> lookupVectorModule() { | |||
"org.apache.lucene.util.VectorUtil", | |||
"org.apache.lucene.codecs.lucene103.Lucene103PostingsReader", | |||
"org.apache.lucene.codecs.lucene103.PostingIndexInput", | |||
"org.apache.lucene.tests.util.TestSysoutsLimits"); | |||
"org.apache.lucene.tests.util.TestSysoutsLimits", | |||
"org.apache.lucene.util.FixedBitSet"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW we should try to avoid expanding this list. Hopefully as per my other comment, we can move bitsetToArray
to VectorUtil
instead so that this list can stay as-is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fully agree.
@@ -114,6 +114,8 @@ public static VectorizationProvider getInstance() { | |||
/** Create a new {@link PostingDecodingUtil} for the given {@link IndexInput}. */ | |||
public abstract PostingDecodingUtil newPostingDecodingUtil(IndexInput input) throws IOException; | |||
|
|||
public abstract BitSetUtil newBitSetUtil(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather have the methods of BitSetUtil
in this class. PostingDecodingUtil
is a bit different because it requires state (the MemorySegment
), which is not the case with BitSetUtil
.
Then we can call bitsetToArray
from VectorUtil
and don't need to add a new class that is allowed to call the vector API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with this. It is stateless so let's reuse the already existing method. Then we can also have the "if Constant" part there.
You can ignore my other note.
|
||
for (int i = 0; i < Integer.SIZE; i += INT_SPECIES.length()) { | ||
IntVector.fromArray(INT_SPECIES, IDENTITY, i) | ||
.add(base) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we broadcast base
to an IntVector
outside of the loop so that it's only done once rather than once per iteration?
private static int intWord2Array(int word, int[] resultArray, int offset, int base) { | ||
IntVector bitMask = IntVector.fromArray(INT_SPECIES, IDENTITY_MASK, 0); | ||
|
||
for (int i = 0; i < Integer.SIZE; i += INT_SPECIES.length()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit uncomfortable with the underlying assumption that INT_SPECIES.length()
is a divisor of Integer.SIZE
. Can we write the code in a way that doesn't make this assumption or add a check somewhere?
|
||
@SuppressForbidden(reason = "Uses compress only where fast and carefully contained") | ||
private static int intWord2Array(int word, int[] resultArray, int offset, int base) { | ||
IntVector bitMask = IntVector.fromArray(INT_SPECIES, IDENTITY_MASK, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this should be a private static final field?
IntVector.fromArray(INT_SPECIES, IDENTITY, i) | ||
.add(base) | ||
.compress(bitMask.and(word).compare(VectorOperators.NE, 0)) | ||
.reinterpretAsInts() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this necessary? Doesn't compress() already return an IntVector
?
: "Array length must be at least bitSet.cardinality(from, to) + 1"; | ||
public final int bitsetToArray(FixedBitSet bitSet, int from, int to, int base, int[] array) { | ||
assert bitSet.cardinality(from, to) + 16 <= array.length | ||
: "Array length must be at least bitSet.cardinality(from, to) + 16"; | ||
|
||
Objects.checkFromToIndex(from, to, bitSet.length()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should refactor this bitsetToArray
method to compute the bitCount of each word up-front to reduce dependencies between iterations of this loop (we rely on the result of wordToArray
to know the next index at which to start writing data, so we can't start the next iteration of the loop before the current one is finished).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, maybe the micro benchmark should be updated to operate on a FixedBitSet
instead of a single word to better capture this sort of things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the idea, i can try. I thought it could be challenge for compiler to know there is no overlap between the writing range across iterations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the result
BitsetToArrayBenchmark.hybrid 256 thrpt 5 5.239 ± 0.202 ops/us
BitsetToArrayBenchmark.hybrid 384 thrpt 5 6.077 ± 0.135 ops/us
BitsetToArrayBenchmark.hybrid 512 thrpt 5 6.259 ± 0.087 ops/us
BitsetToArrayBenchmark.hybrid 768 thrpt 5 5.926 ± 0.078 ops/us
BitsetToArrayBenchmark.hybrid 1024 thrpt 5 4.889 ± 0.056 ops/us
BitsetToArrayBenchmark.hybridUnrolling 256 thrpt 5 4.850 ± 0.054 ops/us
BitsetToArrayBenchmark.hybridUnrolling 384 thrpt 5 5.940 ± 0.086 ops/us
BitsetToArrayBenchmark.hybridUnrolling 512 thrpt 5 6.271 ± 0.098 ops/us
BitsetToArrayBenchmark.hybridUnrolling 768 thrpt 5 5.328 ± 0.106 ops/us
BitsetToArrayBenchmark.hybridUnrolling 1024 thrpt 5 4.174 ± 0.059 ops/us
In case you are interested, here is the full result of this new benchmark.
Benchmark (bitLength) Mode Cnt Score Error Units
BitsetToArrayBenchmark.dense 256 thrpt 5 2.152 ± 0.017 ops/us
BitsetToArrayBenchmark.dense 384 thrpt 5 1.308 ± 0.024 ops/us
BitsetToArrayBenchmark.dense 512 thrpt 5 1.156 ± 0.020 ops/us
BitsetToArrayBenchmark.dense 768 thrpt 5 0.991 ± 0.024 ops/us
BitsetToArrayBenchmark.dense 1024 thrpt 5 0.888 ± 0.020 ops/us
BitsetToArrayBenchmark.denseBranchLess 256 thrpt 5 5.646 ± 0.050 ops/us
BitsetToArrayBenchmark.denseBranchLess 384 thrpt 5 3.999 ± 0.044 ops/us
BitsetToArrayBenchmark.denseBranchLess 512 thrpt 5 3.097 ± 0.085 ops/us
BitsetToArrayBenchmark.denseBranchLess 768 thrpt 5 2.099 ± 0.017 ops/us
BitsetToArrayBenchmark.denseBranchLess 1024 thrpt 5 1.622 ± 0.020 ops/us
BitsetToArrayBenchmark.denseBranchLessCmov 256 thrpt 5 3.692 ± 0.032 ops/us
BitsetToArrayBenchmark.denseBranchLessCmov 384 thrpt 5 2.572 ± 0.033 ops/us
BitsetToArrayBenchmark.denseBranchLessCmov 512 thrpt 5 1.970 ± 0.018 ops/us
BitsetToArrayBenchmark.denseBranchLessCmov 768 thrpt 5 0.815 ± 0.015 ops/us
BitsetToArrayBenchmark.denseBranchLessCmov 1024 thrpt 5 0.728 ± 0.008 ops/us
BitsetToArrayBenchmark.denseBranchLessParallel 256 thrpt 5 5.803 ± 0.054 ops/us
BitsetToArrayBenchmark.denseBranchLessParallel 384 thrpt 5 4.106 ± 0.056 ops/us
BitsetToArrayBenchmark.denseBranchLessParallel 512 thrpt 5 3.202 ± 0.033 ops/us
BitsetToArrayBenchmark.denseBranchLessParallel 768 thrpt 5 2.181 ± 0.018 ops/us
BitsetToArrayBenchmark.denseBranchLessParallel 1024 thrpt 5 1.657 ± 0.019 ops/us
BitsetToArrayBenchmark.denseBranchLessUnrolling 256 thrpt 5 6.157 ± 0.104 ops/us
BitsetToArrayBenchmark.denseBranchLessUnrolling 384 thrpt 5 4.380 ± 0.042 ops/us
BitsetToArrayBenchmark.denseBranchLessUnrolling 512 thrpt 5 3.392 ± 0.060 ops/us
BitsetToArrayBenchmark.denseBranchLessUnrolling 768 thrpt 5 2.354 ± 0.023 ops/us
BitsetToArrayBenchmark.denseBranchLessUnrolling 1024 thrpt 5 1.794 ± 0.012 ops/us
BitsetToArrayBenchmark.denseInvert 256 thrpt 5 1.865 ± 0.034 ops/us
BitsetToArrayBenchmark.denseInvert 384 thrpt 5 1.791 ± 0.025 ops/us
BitsetToArrayBenchmark.denseInvert 512 thrpt 5 1.817 ± 0.018 ops/us
BitsetToArrayBenchmark.denseInvert 768 thrpt 5 1.904 ± 0.015 ops/us
BitsetToArrayBenchmark.denseInvert 1024 thrpt 5 1.816 ± 0.016 ops/us
BitsetToArrayBenchmark.forLoop 256 thrpt 5 5.645 ± 0.081 ops/us
BitsetToArrayBenchmark.forLoop 384 thrpt 5 6.118 ± 0.073 ops/us
BitsetToArrayBenchmark.forLoop 512 thrpt 5 6.352 ± 0.068 ops/us
BitsetToArrayBenchmark.forLoop 768 thrpt 5 5.957 ± 0.081 ops/us
BitsetToArrayBenchmark.forLoop 1024 thrpt 5 4.931 ± 0.088 ops/us
BitsetToArrayBenchmark.forLoopManualUnrolling 256 thrpt 5 5.411 ± 0.156 ops/us
BitsetToArrayBenchmark.forLoopManualUnrolling 384 thrpt 5 5.699 ± 0.238 ops/us
BitsetToArrayBenchmark.forLoopManualUnrolling 512 thrpt 5 5.585 ± 0.161 ops/us
BitsetToArrayBenchmark.forLoopManualUnrolling 768 thrpt 5 4.667 ± 0.101 ops/us
BitsetToArrayBenchmark.forLoopManualUnrolling 1024 thrpt 5 3.999 ± 0.037 ops/us
BitsetToArrayBenchmark.hybrid 256 thrpt 5 5.239 ± 0.202 ops/us
BitsetToArrayBenchmark.hybrid 384 thrpt 5 6.077 ± 0.135 ops/us
BitsetToArrayBenchmark.hybrid 512 thrpt 5 6.259 ± 0.087 ops/us
BitsetToArrayBenchmark.hybrid 768 thrpt 5 5.926 ± 0.078 ops/us
BitsetToArrayBenchmark.hybrid 1024 thrpt 5 4.889 ± 0.056 ops/us
BitsetToArrayBenchmark.hybridUnrolling 256 thrpt 5 4.850 ± 0.054 ops/us
BitsetToArrayBenchmark.hybridUnrolling 384 thrpt 5 5.940 ± 0.086 ops/us
BitsetToArrayBenchmark.hybridUnrolling 512 thrpt 5 6.271 ± 0.098 ops/us
BitsetToArrayBenchmark.hybridUnrolling 768 thrpt 5 5.328 ± 0.106 ops/us
BitsetToArrayBenchmark.hybridUnrolling 1024 thrpt 5 4.174 ± 0.059 ops/us
BitsetToArrayBenchmark.whileLoop 256 thrpt 5 3.932 ± 0.021 ops/us
BitsetToArrayBenchmark.whileLoop 384 thrpt 5 3.779 ± 0.054 ops/us
BitsetToArrayBenchmark.whileLoop 512 thrpt 5 3.889 ± 0.049 ops/us
BitsetToArrayBenchmark.whileLoop 768 thrpt 5 3.747 ± 0.083 ops/us
BitsetToArrayBenchmark.whileLoop 1024 thrpt 5 3.856 ± 0.088 ops/us
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also it is interesting to see forLoop is much faster than while loop, which meets my luceneutil result while previous benchmark did not show.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition, the vectorized method benchmarks
Benchmark (bitLength) Mode Cnt Score Error Units
BitsetToArrayBenchmark.denseBranchLessVectorized 256 thrpt 5 17.648 ± 0.059 ops/us
BitsetToArrayBenchmark.denseBranchLessVectorized 384 thrpt 5 14.186 ± 0.049 ops/us
BitsetToArrayBenchmark.denseBranchLessVectorized 512 thrpt 5 12.102 ± 0.203 ops/us
BitsetToArrayBenchmark.denseBranchLessVectorized 768 thrpt 5 9.585 ± 0.068 ops/us
BitsetToArrayBenchmark.denseBranchLessVectorized 1024 thrpt 5 7.433 ± 0.093 ops/us
BitsetToArrayBenchmark.denseBranchLessVectorizedAVX2 256 thrpt 5 10.885 ± 0.069 ops/us
BitsetToArrayBenchmark.denseBranchLessVectorizedAVX2 384 thrpt 5 7.534 ± 0.280 ops/us
BitsetToArrayBenchmark.denseBranchLessVectorizedAVX2 512 thrpt 5 6.894 ± 0.019 ops/us
BitsetToArrayBenchmark.denseBranchLessVectorizedAVX2 768 thrpt 5 5.068 ± 0.014 ops/us
BitsetToArrayBenchmark.denseBranchLessVectorizedAVX2 1024 thrpt 5 3.862 ± 0.109 ops/us
BitsetToArrayBenchmark.denseBranchLessVectorizedFromLong 256 thrpt 5 18.918 ± 0.168 ops/us
BitsetToArrayBenchmark.denseBranchLessVectorizedFromLong 384 thrpt 5 18.055 ± 0.093 ops/us
BitsetToArrayBenchmark.denseBranchLessVectorizedFromLong 512 thrpt 5 17.804 ± 0.079 ops/us
BitsetToArrayBenchmark.denseBranchLessVectorizedFromLong 768 thrpt 5 16.803 ± 0.051 ops/us
BitsetToArrayBenchmark.denseBranchLessVectorizedFromLong 1024 thrpt 5 17.066 ± 0.042 ops/us
FWIW I benchmarked this change on my machine (AVX2), there may be a small speedup for some queries but not as much as previously reported:
|
See Adrien's comment. |
I guess your benchmark runs against current
|
According to this report, this optimization can only have noticeable improvements on AVX512, so now i am actually a bit hesitant to move on. |
Much better now. I am still not sure if this is all worth the complexity! |
This is a minimal prof to describe an idea about how to vectorize a bitset into an array, which can be a hot path when posting is encoded as a bitset. This version currently only runs on AVX512, but can be adapted to more in the future.