Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Fix Presto Java UUID serialization #11197

Conversation

BryanCutler
Copy link
Contributor

This fixes the PrestoSerializer to put UUID values in the correct format that is expected by Presto Java so that the values will match those from a Java worker. First, when converting UUID to/from string, the values are no longer in big endian format (as taken from boost::uuid) and are instead stored as a little endian in an int128_t. Secondly, Presto Java will read UUID values from an Int128ArrayBlock with the first value as the most significant bits. To correct this, the upper/lower parts of the int128_t are swapped during serialization/deserialization.

A unit test for checking roundtrip UUID serializaiton was added and manual testing of Presto with a native worker to verify the problem from the issue description is fixed.

From prestodb/presto#23311

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 8, 2024
Copy link

netlify bot commented Oct 8, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit bf000b0
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/673d76747b8ff600082c98c2

@BryanCutler
Copy link
Contributor Author

please review @aditi-pandit @Yuhta @mbasmanova

velox/functions/prestosql/tests/UuidFunctionsTest.cpp Outdated Show resolved Hide resolved
velox/functions/prestosql/types/UuidType.cpp Outdated Show resolved Hide resolved
velox/functions/prestosql/types/UuidType.cpp Outdated Show resolved Hide resolved
@BryanCutler
Copy link
Contributor Author

Thanks @Yuhta , but folly::Endian::big is not defined for int128 types. It's based off of compiler built-in bytes swapping functions, so it looks like it could be added for GCC variants, but there is no built-in function for MSVC. I could add a check to these changes to only swap if the host system is LE, wdyt?

@Yuhta
Copy link
Contributor

Yuhta commented Oct 16, 2024

@BryanCutler You can add a utility to DecimalUtil to do that. We can do it with 2 folly::Endian::bigs if needed.

@BryanCutler BryanCutler force-pushed the fix-uuid-java-serde-23311 branch 2 times, most recently from beb2740 to d608cfe Compare October 24, 2024 01:27
@BryanCutler BryanCutler force-pushed the fix-uuid-java-serde-23311 branch 2 times, most recently from adb9e4a to 2ff3b4f Compare November 1, 2024 19:05
@BryanCutler
Copy link
Contributor Author

Status: I've added DecimalUtil::big() to swap bytes for int128_t, but there is an active PR at prestodb/presto#23847 which changes UUID format on the Presto Java side. There would need to be an additional adjustment to the changes here to compensate if the Java UUID PR is merged.

@BryanCutler
Copy link
Contributor Author

I have updated to work with the latest changes from prestodb/presto#23847. The serialization makes sure Presto Java receives values according to the format described in prestodb/presto#23961 (comment)

I have manually tested these changes with Presto Java following the error description from prestodb/presto#23311 and confirmed the values are now displayed correctly.

@Yuhta please review again, thanks.

@BryanCutler BryanCutler requested a review from Yuhta November 6, 2024 22:31
Copy link
Collaborator

@aditi-pandit aditi-pandit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @BryanCutler

Would be great to enhance the VectorFuzzer (https://github.com/facebookincubator/velox/blob/main/velox/vector/fuzzer/VectorFuzzer.h) to build vectors of random UUID values and use it in the fuzzers to incorporate this type and functions in different queries

https://facebookincubator.github.io/velox/develop/testing.html

Some of these fuzzers have also been enhanced to do validation of queries with Presto. That would be a good exercise to ensure the serialization works e2e.

Please consider adding that support as well.

@kagamiori @kgpai

Wei, Krishna : To give some background, we found these correctness issues and gaps when running some e2e queries using UUID type prestodb/presto#23311. Would be great to enhance the fuzzers to catch the issues with this type.

for (int row = 0; row < uuidValues.size(); row++) {
uuidValues[row] = (int128_t) 0xD1 << row % 120;
}
auto vector = makeFlatVector<int128_t>(uuidValues, UUID());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

velox/type/DecimalUtil.cpp Outdated Show resolved Hide resolved
@@ -127,6 +128,9 @@ class UuidCastOperator : public exec::CastOperator {
int128_t u;
memcpy(&u, &uuid, 16);

// Value is big endian from boost, store as native byte-order
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit : Comments should be complete sentences and end in a full-stop.

@@ -75,7 +75,8 @@ class UuidCastOperator : public exec::CastOperator {
const auto* uuids = input.as<SimpleVector<int128_t>>();

context.applyToSelectedNoThrow(rows, [&](auto row) {
const auto uuid = uuids->valueAt(row);
// Make sure UUID bytes are big endian when building string
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit : Comments should be complete sentences and end in a full-stop.

@@ -443,6 +445,42 @@ void readDecimalValues(
}
}

int128_t readUuidValue(ByteInputStream* source) {
// ByteInputStream does not support reading int128_t values.
// UUIDs are serialized as 2 int64 values with msb int64 value first.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit : uint64 (not int64)

velox/serializers/PrestoSerializer.cpp Outdated Show resolved Hide resolved
velox/serializers/PrestoSerializer.cpp Outdated Show resolved Hide resolved
@BryanCutler BryanCutler force-pushed the fix-uuid-java-serde-23311 branch from 65e09f1 to fad021d Compare November 13, 2024 01:21
@BryanCutler
Copy link
Contributor Author

Thanks for reviewing @aditi-pandit . What do you think about me working on additional fuzzer tests as a followup if the rest of this PR is ready? I did add a serialization round trip test that goes through a full range of values, but I do think improving the fuzzer to use UUIDs sounds like a good addition.

@aditi-pandit
Copy link
Collaborator

Thanks for reviewing @aditi-pandit . What do you think about me working on additional fuzzer tests as a followup if the rest of this PR is ready? I did add a serialization round trip test that goes through a full range of values, but I do think improving the fuzzer to use UUIDs sounds like a good addition.

Thats fine @BryanCutler. Thanks !

Copy link
Collaborator

@aditi-pandit aditi-pandit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @BryanCutler

@BryanCutler
Copy link
Contributor Author

@Yuhta please review, thanks

@BryanCutler BryanCutler force-pushed the fix-uuid-java-serde-23311 branch 2 times, most recently from a51aec9 to da0e6cb Compare November 13, 2024 20:03
velox/type/DecimalUtil.h Outdated Show resolved Hide resolved
@BryanCutler BryanCutler force-pushed the fix-uuid-java-serde-23311 branch from da0e6cb to d2ebf7f Compare November 14, 2024 05:55
velox/type/DecimalUtil.h Outdated Show resolved Hide resolved
@BryanCutler BryanCutler force-pushed the fix-uuid-java-serde-23311 branch from d2ebf7f to 958efe6 Compare November 14, 2024 18:17
@BryanCutler
Copy link
Contributor Author

@Yuhta please review, thanks!

@BryanCutler BryanCutler force-pushed the fix-uuid-java-serde-23311 branch 2 times, most recently from d23db6f to 3c8c85d Compare November 19, 2024 20:28
@Yuhta Yuhta added the ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall label Nov 19, 2024
@BryanCutler BryanCutler force-pushed the fix-uuid-java-serde-23311 branch from 3c8c85d to 7b7123b Compare November 19, 2024 21:47
@BryanCutler
Copy link
Contributor Author

I missed a format fix, could you try again @Yuhta ?

@facebook-github-bot
Copy link
Contributor

@kagamiori has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@@ -16,6 +16,7 @@
#include "velox/serializers/PrestoSerializer.h"
#include <boost/random/uniform_int_distribution.hpp>
#include <folly/Random.h>
#include <functions/prestosql/types/UuidType.h>
Copy link
Contributor

@kagamiori kagamiori Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @BryanCutler
nit: Is this header not needed? If so, could you remove it? Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, good catch. No it's not needed anymore, I've removed it.

Summary:
This fixes the PrestoSerializer to put UUID values in the correct format that
is expected by Presto Java so that the values will match those from a Java
worker. First, when converting UUID to/from string, the values are no longer in
big endian format (as taken from boost::uuid) and are instead stored as a
little endian in an int128_t. Secondly, Presto Java will read UUID values from
an Int128ArrayBlock with the first value as the most significant bits. To
correct this, the upper/lower parts of the int128_t are swapped during
serialization/deserialization.

A unit test for checking roundtrip UUID serializaiton was added and manual
testing of Presto with a native worker to verify the problem from the issue
description is fixed.

From prestodb/presto#23311
@BryanCutler BryanCutler force-pushed the fix-uuid-java-serde-23311 branch from 7b7123b to bf000b0 Compare November 20, 2024 05:41
@facebook-github-bot
Copy link
Contributor

@kagamiori has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@kagamiori kagamiori changed the title Fix Presto Java UUID serialization fix: Fix Presto Java UUID serialization Nov 20, 2024
@facebook-github-bot
Copy link
Contributor

@kagamiori merged this pull request in fe4f5a7.

Copy link

Conbench analyzed the 1 benchmark run on commit fe4f5a77.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants