-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Fix Presto Java UUID serialization #11197
fix: Fix Presto Java UUID serialization #11197
Conversation
✅ Deploy Preview for meta-velox canceled.
|
please review @aditi-pandit @Yuhta @mbasmanova |
Thanks @Yuhta , but |
@BryanCutler You can add a utility to |
beb2740
to
d608cfe
Compare
adb9e4a
to
2ff3b4f
Compare
Status: I've added |
2ff3b4f
to
65e09f1
Compare
I have updated to work with the latest changes from prestodb/presto#23847. The serialization makes sure Presto Java receives values according to the format described in prestodb/presto#23961 (comment) I have manually tested these changes with Presto Java following the error description from prestodb/presto#23311 and confirmed the values are now displayed correctly. @Yuhta please review again, thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @BryanCutler
Would be great to enhance the VectorFuzzer (https://github.com/facebookincubator/velox/blob/main/velox/vector/fuzzer/VectorFuzzer.h) to build vectors of random UUID values and use it in the fuzzers to incorporate this type and functions in different queries
https://facebookincubator.github.io/velox/develop/testing.html
Some of these fuzzers have also been enhanced to do validation of queries with Presto. That would be a good exercise to ensure the serialization works e2e.
Please consider adding that support as well.
Wei, Krishna : To give some background, we found these correctness issues and gaps when running some e2e queries using UUID type prestodb/presto#23311. Would be great to enhance the fuzzers to catch the issues with this type.
for (int row = 0; row < uuidValues.size(); row++) { | ||
uuidValues[row] = (int128_t) 0xD1 << row % 120; | ||
} | ||
auto vector = makeFlatVector<int128_t>(uuidValues, UUID()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use variation https://github.com/facebookincubator/velox/blob/main/velox/vector/tests/utils/VectorTestBase.h#L183 of makeFlatVector to avoid the for loops above.
@@ -127,6 +128,9 @@ class UuidCastOperator : public exec::CastOperator { | |||
int128_t u; | |||
memcpy(&u, &uuid, 16); | |||
|
|||
// Value is big endian from boost, store as native byte-order |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit : Comments should be complete sentences and end in a full-stop.
@@ -75,7 +75,8 @@ class UuidCastOperator : public exec::CastOperator { | |||
const auto* uuids = input.as<SimpleVector<int128_t>>(); | |||
|
|||
context.applyToSelectedNoThrow(rows, [&](auto row) { | |||
const auto uuid = uuids->valueAt(row); | |||
// Make sure UUID bytes are big endian when building string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit : Comments should be complete sentences and end in a full-stop.
@@ -443,6 +445,42 @@ void readDecimalValues( | |||
} | |||
} | |||
|
|||
int128_t readUuidValue(ByteInputStream* source) { | |||
// ByteInputStream does not support reading int128_t values. | |||
// UUIDs are serialized as 2 int64 values with msb int64 value first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit : uint64 (not int64)
65e09f1
to
fad021d
Compare
Thanks for reviewing @aditi-pandit . What do you think about me working on additional fuzzer tests as a followup if the rest of this PR is ready? I did add a serialization round trip test that goes through a full range of values, but I do think improving the fuzzer to use UUIDs sounds like a good addition. |
Thats fine @BryanCutler. Thanks ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @BryanCutler
@Yuhta please review, thanks |
a51aec9
to
da0e6cb
Compare
da0e6cb
to
d2ebf7f
Compare
d2ebf7f
to
958efe6
Compare
@Yuhta please review, thanks! |
d23db6f
to
3c8c85d
Compare
3c8c85d
to
7b7123b
Compare
I missed a format fix, could you try again @Yuhta ? |
@kagamiori has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@@ -16,6 +16,7 @@ | |||
#include "velox/serializers/PrestoSerializer.h" | |||
#include <boost/random/uniform_int_distribution.hpp> | |||
#include <folly/Random.h> | |||
#include <functions/prestosql/types/UuidType.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @BryanCutler
nit: Is this header not needed? If so, could you remove it? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, good catch. No it's not needed anymore, I've removed it.
Summary: This fixes the PrestoSerializer to put UUID values in the correct format that is expected by Presto Java so that the values will match those from a Java worker. First, when converting UUID to/from string, the values are no longer in big endian format (as taken from boost::uuid) and are instead stored as a little endian in an int128_t. Secondly, Presto Java will read UUID values from an Int128ArrayBlock with the first value as the most significant bits. To correct this, the upper/lower parts of the int128_t are swapped during serialization/deserialization. A unit test for checking roundtrip UUID serializaiton was added and manual testing of Presto with a native worker to verify the problem from the issue description is fixed. From prestodb/presto#23311
7b7123b
to
bf000b0
Compare
@kagamiori has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@kagamiori merged this pull request in fe4f5a7. |
Conbench analyzed the 1 benchmark run on commit There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
This fixes the
PrestoSerializer
to put UUID values in the correct format that is expected by Presto Java so that the values will match those from a Java worker. First, when converting UUID to/from string, the values are no longer in big endian format (as taken from boost::uuid) and are instead stored as a little endian in an int128_t. Secondly, Presto Java will read UUID values from anInt128ArrayBlock
with the first value as the most significant bits. To correct this, the upper/lower parts of the int128_t are swapped during serialization/deserialization.A unit test for checking roundtrip UUID serializaiton was added and manual testing of Presto with a native worker to verify the problem from the issue description is fixed.
From prestodb/presto#23311