Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Result mismatches between Presto java and native for UUID types #23311

Open
aditi-pandit opened this issue Jul 26, 2024 · 11 comments
Open

Result mismatches between Presto java and native for UUID types #23311

aditi-pandit opened this issue Jul 26, 2024 · 11 comments
Assignees
Labels

Comments

@aditi-pandit
Copy link
Contributor

aditi-pandit commented Jul 26, 2024

Queries with UUID have different results in Java vs Native

Your Environment

  • Presto version used: 0.288

Expected Behavior

UUIDs are typically stored as VARCHAR or VARBINARY values in HMS/DWRF/Parquet etc. They are cast as UUID in subsequent SQL.

Presto Java uses Java UUID to represent UUID, whereas the block representation for native is int128. When getting results from presto-cli the order of bytes is flipped. This leads to misrepresentations.

Steps to Reproduce

presto:tpch_old> CREATE TABLE a AS SELECT c_uuid  FROM ( VALUES (null), ('33355449-2c7d-43d7-967a-f53cd23215ad'), ('eed9f812-4b0c-472f-8a10-4ae7bff79a47'), ('f768f36d-4f09-4da7-a298-3564d8f3c986')) AS x (c_uuid)

Presto Java 
select CAST(c_uuid as uuid) from a;
                _col0                 
--------------------------------------
 NULL                                 
 33355449-2c7d-43d7-967a-f53cd23215ad 
 eed9f812-4b0c-472f-8a10-4ae7bff79a47 
 f768f36d-4f09-4da7-a298-3564d8f3c986 
(4 rows)

Presto C++

select CAST(c_uuid as uuid) from a;
                _col0     
--------------------------------------
 NULL          
d7437d2c-4954-3533-ad15-32d23cf57a96
2f470c4b-12f8-d9ee-479a-f7bfe74a108a
a74d094f-6df3-68f7-86c9-f3d8643598a2


SELECT CAST(CAST(c_uuid AS uuid) AS VARCHAR) returns the same results however.
                _col0                 
--------------------------------------
 NULL                                 
 33355449-2c7d-43d7-967a-f53cd23215ad 
 eed9f812-4b0c-472f-8a10-4ae7bff79a47 
 f768f36d-4f09-4da7-a298-3564d8f3c986 
(4 rows)
@aditi-pandit
Copy link
Contributor Author

aditi-pandit commented Jul 26, 2024

@amitkdutta
Copy link
Contributor

Cc @kgpai @kagamiori

@aditi-pandit
Copy link
Contributor Author

The fix could be in the PrestoSerializer/Deserializer for UUID to use the byte-ordering needed by the Presto Java side. Prototyping a fix.

@tdcmeehan
Copy link
Contributor

I wonder if this is a correctness bug, since the encodings of UUIDs are a well-defined format: https://datatracker.ietf.org/doc/html/rfc9562. i.e. if anyone relies on comparing on persisted UUIDs, this might be incorrect since it seems the byte order is revered on either the Java or C++ side?

@BryanCutler
Copy link
Contributor

Hi @mohsaka , please let me know if there was anything I could help out with this issue, thanks.

@mohsaka
Copy link
Contributor

mohsaka commented Aug 6, 2024

Hi @BryanCutler
Feel free to take over the issue. I am not actively working on it.

@ethanyzhang ethanyzhang assigned BryanCutler and unassigned mohsaka Sep 11, 2024
@mohsaka
Copy link
Contributor

mohsaka commented Sep 20, 2024

So I think the underlying issue is actually with HUGEINT in presto that was never discovered since HUGEINT isn't a type in presto.

@BryanCutler Pointed out that its actually not reversed but rather reversed and swapped 64 bit words.

The main issue is that the HUGEINT type in presto is represented by an array of longs in the form of [MSW, LSW].
However in Velox, we store it in the form of Little Endian, [LSW,MSW].

Therefore when we send it over to presto, we are sending it over as an array of two longs or 64 bit words. Which means [LSW,MSW]. Then the serializer takes into account the endian reversal which results in us having [Reversed LSW, Reversed MSW].

So I recommended that he fix the HUGEINT issue first, then fix the reversal issue by putting the proper HUGEINT decimal value in so that when the endian is flipped we get the proper value in Java presto.

@mohsaka
Copy link
Contributor

mohsaka commented Sep 20, 2024

Thought about it some more. We would have to limit the Serializer changes to UUID only, as I presume Long Decimal is working correctly.

So either we can make the change in the serializer or we can make the change on the velox side when creating the int128_t.

@BryanCutler
Copy link
Contributor

Thanks @mohsaka , I did some more digging today and realized the root cause of the swapped bytes is in castFromString, here https://github.com/facebookincubator/velox/blob/8cd2d1ae694a9960ea4525fb838d8c02d9a91fa5/velox/functions/prestosql/types/UuidType.cpp#L128

The problem is that boost::uuid stores the data as a byte array with MSB first. Then a memcpy is done, so the result is an int128_t with upper and lower parts in BE format. For example, given this string as input

"33355449-2c7d-43d7-967a-f53cd23215ad"

boost::uuid will store as:

[0] = {uint8_t} 0x33 '3'
[1] = {uint8_t} 0x35 '5'
[2] = {uint8_t} 0x54 'T'
[3] = {uint8_t} 0x49 'I'
[4] = {uint8_t} 0x2c ','
[5] = {uint8_t} 0x7d '}'
[6] = {uint8_t} 0x43 'C'
[7] = {uint8_t} 0xd7 '\327'
[8] = {uint8_t} 0x96 '\226'
[9] = {uint8_t} 0x7a 'z'
[10] = {uint8_t} 0xf5 '\365'
[11] = {uint8_t} 0x3c '<'
[12] = {uint8_t} 0xd2 '\322'
[13] = {uint8_t} 0x32 '2'
[14] = {uint8_t} 0x15 '\025'
[15] = {uint8_t} 0xad '\255'

Then after the memcpy, we get incorrect values [0xd7437d2c49543533, 0xad1532d23cf57a96].

After changing the memcpy to assign the bytes in the correct order, they are also transferred to Java correctly - so there isn't any need for further byte swapping. However, you're correct in that the native worker serializes the HUGEINT as
[LSW,MSW], then when building a Java UUID it expects [MSW, LSW] in the `Int128ArrayBlock, so these word values will need to be swapped for UUID types in the serializer.

@mohsaka
Copy link
Contributor

mohsaka commented Sep 20, 2024

@BryanCutler Thank you for the update!

But I think the memcpy is correct?

Given the example
33355449-2c7d-43d7 -- 967a-f53cd23215ad
In the UUID type it would be separated into hex pairs. 33 35 ... etc.
Then we do a memcpy into memory. With 64 bit addresses then
0: [33, 35, 54, 49, 2c, 7d, 43, d7]
8: [96, 7a, f5, 3c, d2, 32, 15, ad]

I assume when you checked the values of the longs you printed the long values in hex?
Then since its little endian it would read right to left. This would match what you are seeing.

So I think the problem is still with both the reversal and the swapping of the array.

@ZacBlanco
Copy link
Contributor

ZacBlanco commented Oct 17, 2024

I believe there are some UUID comparisons in presto-java that are incorrect and should be fixed. I opened a PR here: #23847 . This likely affects the native implementation

BryanCutler added a commit to BryanCutler/velox that referenced this issue Nov 1, 2024
Summary:
This fixes the PrestoSerializer to put UUID values in the correct format that
is expected by Presto Java so that the values will match those from a Java
worker. First, when converting UUID to/from string, the values are no longer in
big endian format (as taken from boost::uuid) and are instead stored as a
little endian in an int128_t. Secondly, Presto Java will read UUID values from
an Int128ArrayBlock with the first value as the most significant bits. To
correct this, the upper/lower parts of the int128_t are swapped during
serialization/deserialization.

A unit test for checking roundtrip UUID serializaiton was added and manual
testing of Presto with a native worker to verify the problem from the issue
description is fixed.

From prestodb/presto#23311
BryanCutler added a commit to BryanCutler/velox that referenced this issue Nov 6, 2024
Summary:
This fixes the PrestoSerializer to put UUID values in the correct format that
is expected by Presto Java so that the values will match those from a Java
worker. First, when converting UUID to/from string, the values are no longer in
big endian format (as taken from boost::uuid) and are instead stored as a
little endian in an int128_t. Secondly, Presto Java will read UUID values from
an Int128ArrayBlock with the first value as the most significant bits. To
correct this, the upper/lower parts of the int128_t are swapped during
serialization/deserialization.

A unit test for checking roundtrip UUID serializaiton was added and manual
testing of Presto with a native worker to verify the problem from the issue
description is fixed.

From prestodb/presto#23311
BryanCutler added a commit to BryanCutler/velox that referenced this issue Nov 7, 2024
Summary:
This fixes the PrestoSerializer to put UUID values in the correct format that
is expected by Presto Java so that the values will match those from a Java
worker. First, when converting UUID to/from string, the values are no longer in
big endian format (as taken from boost::uuid) and are instead stored as a
little endian in an int128_t. Secondly, Presto Java will read UUID values from
an Int128ArrayBlock with the first value as the most significant bits. To
correct this, the upper/lower parts of the int128_t are swapped during
serialization/deserialization.

A unit test for checking roundtrip UUID serializaiton was added and manual
testing of Presto with a native worker to verify the problem from the issue
description is fixed.

From prestodb/presto#23311
BryanCutler added a commit to BryanCutler/velox that referenced this issue Nov 13, 2024
Summary:
This fixes the PrestoSerializer to put UUID values in the correct format that
is expected by Presto Java so that the values will match those from a Java
worker. First, when converting UUID to/from string, the values are no longer in
big endian format (as taken from boost::uuid) and are instead stored as a
little endian in an int128_t. Secondly, Presto Java will read UUID values from
an Int128ArrayBlock with the first value as the most significant bits. To
correct this, the upper/lower parts of the int128_t are swapped during
serialization/deserialization.

A unit test for checking roundtrip UUID serializaiton was added and manual
testing of Presto with a native worker to verify the problem from the issue
description is fixed.

From prestodb/presto#23311
BryanCutler added a commit to BryanCutler/velox that referenced this issue Nov 13, 2024
Summary:
This fixes the PrestoSerializer to put UUID values in the correct format that
is expected by Presto Java so that the values will match those from a Java
worker. First, when converting UUID to/from string, the values are no longer in
big endian format (as taken from boost::uuid) and are instead stored as a
little endian in an int128_t. Secondly, Presto Java will read UUID values from
an Int128ArrayBlock with the first value as the most significant bits. To
correct this, the upper/lower parts of the int128_t are swapped during
serialization/deserialization.

A unit test for checking roundtrip UUID serializaiton was added and manual
testing of Presto with a native worker to verify the problem from the issue
description is fixed.

From prestodb/presto#23311
BryanCutler added a commit to BryanCutler/velox that referenced this issue Nov 13, 2024
Summary:
This fixes the PrestoSerializer to put UUID values in the correct format that
is expected by Presto Java so that the values will match those from a Java
worker. First, when converting UUID to/from string, the values are no longer in
big endian format (as taken from boost::uuid) and are instead stored as a
little endian in an int128_t. Secondly, Presto Java will read UUID values from
an Int128ArrayBlock with the first value as the most significant bits. To
correct this, the upper/lower parts of the int128_t are swapped during
serialization/deserialization.

A unit test for checking roundtrip UUID serializaiton was added and manual
testing of Presto with a native worker to verify the problem from the issue
description is fixed.

From prestodb/presto#23311
BryanCutler added a commit to BryanCutler/velox that referenced this issue Nov 14, 2024
Summary:
This fixes the PrestoSerializer to put UUID values in the correct format that
is expected by Presto Java so that the values will match those from a Java
worker. First, when converting UUID to/from string, the values are no longer in
big endian format (as taken from boost::uuid) and are instead stored as a
little endian in an int128_t. Secondly, Presto Java will read UUID values from
an Int128ArrayBlock with the first value as the most significant bits. To
correct this, the upper/lower parts of the int128_t are swapped during
serialization/deserialization.

A unit test for checking roundtrip UUID serializaiton was added and manual
testing of Presto with a native worker to verify the problem from the issue
description is fixed.

From prestodb/presto#23311
BryanCutler added a commit to BryanCutler/velox that referenced this issue Nov 14, 2024
Summary:
This fixes the PrestoSerializer to put UUID values in the correct format that
is expected by Presto Java so that the values will match those from a Java
worker. First, when converting UUID to/from string, the values are no longer in
big endian format (as taken from boost::uuid) and are instead stored as a
little endian in an int128_t. Secondly, Presto Java will read UUID values from
an Int128ArrayBlock with the first value as the most significant bits. To
correct this, the upper/lower parts of the int128_t are swapped during
serialization/deserialization.

A unit test for checking roundtrip UUID serializaiton was added and manual
testing of Presto with a native worker to verify the problem from the issue
description is fixed.

From prestodb/presto#23311
BryanCutler added a commit to BryanCutler/velox that referenced this issue Nov 19, 2024
Summary:
This fixes the PrestoSerializer to put UUID values in the correct format that
is expected by Presto Java so that the values will match those from a Java
worker. First, when converting UUID to/from string, the values are no longer in
big endian format (as taken from boost::uuid) and are instead stored as a
little endian in an int128_t. Secondly, Presto Java will read UUID values from
an Int128ArrayBlock with the first value as the most significant bits. To
correct this, the upper/lower parts of the int128_t are swapped during
serialization/deserialization.

A unit test for checking roundtrip UUID serializaiton was added and manual
testing of Presto with a native worker to verify the problem from the issue
description is fixed.

From prestodb/presto#23311
BryanCutler added a commit to BryanCutler/velox that referenced this issue Nov 19, 2024
Summary:
This fixes the PrestoSerializer to put UUID values in the correct format that
is expected by Presto Java so that the values will match those from a Java
worker. First, when converting UUID to/from string, the values are no longer in
big endian format (as taken from boost::uuid) and are instead stored as a
little endian in an int128_t. Secondly, Presto Java will read UUID values from
an Int128ArrayBlock with the first value as the most significant bits. To
correct this, the upper/lower parts of the int128_t are swapped during
serialization/deserialization.

A unit test for checking roundtrip UUID serializaiton was added and manual
testing of Presto with a native worker to verify the problem from the issue
description is fixed.

From prestodb/presto#23311
BryanCutler added a commit to BryanCutler/velox that referenced this issue Nov 19, 2024
Summary:
This fixes the PrestoSerializer to put UUID values in the correct format that
is expected by Presto Java so that the values will match those from a Java
worker. First, when converting UUID to/from string, the values are no longer in
big endian format (as taken from boost::uuid) and are instead stored as a
little endian in an int128_t. Secondly, Presto Java will read UUID values from
an Int128ArrayBlock with the first value as the most significant bits. To
correct this, the upper/lower parts of the int128_t are swapped during
serialization/deserialization.

A unit test for checking roundtrip UUID serializaiton was added and manual
testing of Presto with a native worker to verify the problem from the issue
description is fixed.

From prestodb/presto#23311
BryanCutler added a commit to BryanCutler/velox that referenced this issue Nov 20, 2024
Summary:
This fixes the PrestoSerializer to put UUID values in the correct format that
is expected by Presto Java so that the values will match those from a Java
worker. First, when converting UUID to/from string, the values are no longer in
big endian format (as taken from boost::uuid) and are instead stored as a
little endian in an int128_t. Secondly, Presto Java will read UUID values from
an Int128ArrayBlock with the first value as the most significant bits. To
correct this, the upper/lower parts of the int128_t are swapped during
serialization/deserialization.

A unit test for checking roundtrip UUID serializaiton was added and manual
testing of Presto with a native worker to verify the problem from the issue
description is fixed.

From prestodb/presto#23311
facebook-github-bot pushed a commit to facebookincubator/velox that referenced this issue Nov 21, 2024
Summary:
This fixes the `PrestoSerializer` to put UUID values in the correct format that is expected by Presto Java so that the values will match those from a Java worker. First, when converting UUID to/from string, the values are no longer in big endian format (as taken from boost::uuid) and are instead stored as a little endian in an int128_t. Secondly, Presto Java will read UUID values from an `Int128ArrayBlock` with the first value as the most significant bits. To correct this, the upper/lower parts of the int128_t are swapped during serialization/deserialization.

A unit test for checking roundtrip UUID serializaiton was added and manual testing of Presto with a native worker to verify the problem from the issue description is fixed.

From prestodb/presto#23311

Pull Request resolved: #11197

Reviewed By: Yuhta

Differential Revision: D66204014

Pulled By: kagamiori

fbshipit-source-id: 1e2842df57e672d6cf056a658108691e3ad39d15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: 🆕 Unprioritized
Development

No branches or pull requests

6 participants