Skip to content

HIVE-29040: Add UUIDV7 built in function #5891

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

BsoBird
Copy link
Contributor

@BsoBird BsoBird commented Jun 24, 2025

What changes were proposed in this pull request?

support uuid_v7 function
https://issues.apache.org/jira/browse/HIVE-29040

Why are the changes needed?

Does this PR introduce any user-facing change?

No.

How was this patch tested?

See TestUDFUUID.class

@BsoBird BsoBird changed the title support uuid_v7 function HIVE-29040: Add UUIDV7 built in function Jun 24, 2025
@BsoBird
Copy link
Contributor Author

BsoBird commented Jun 24, 2025

@zhangbutao hello,Sir.
Can you help check this?tks.

}


private UUID randomUUIDV7() {
Copy link
Member

@deniskuzZ deniskuzZ Jun 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deniskuzZ https://github.com/robsonkades/uuidv7
This is a fast method for calculating UUIDV7.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/f4b6a3/uuid-creator seems to have higher rating score

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deniskuzZ Okay,let me see.

Copy link
Contributor Author

@BsoBird BsoBird Jun 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deniskuzZ I reviewed the code and conducted some simple tests. Currently, I have selected a UUIDv7 generation algorithm with relatively the best performance (approximately generating 18 million UUIDs per second in a single thread). Of course, if we need to switch to a different implementation library, I have no objections. What do you think? After all, the UUIDv7 algorithm isn't very complex, so perhaps there's no need to introduce a complicated library?After all, the performance of using SecureRandom + ByteBuffer to generate UUIDV7 is quite poor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UUID users expect every generated ID to be unique. The random generator needs to be as secure as we can guarantee the uniqueness in my opinion.
As for the dependency, using mvn would make it easier to maintain licenses.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It really depends if we are talking about adding a new dependency to a huge library or if we are adding a dependency with a single Java class. Copying code is not a good practice and has various disadvantages.

If for instance you copy the code in Hive then we cannot take advantage of improvements/bug fixes/security patches that will land in the dependent library. Nobody, will care to bring these changes to Hive and it may be too late when we realize that somethings is needed.

Moreover, when you copy code from open-source projects and you have to follow the ASF policies (i.e., https://www.apache.org/legal/src-headers.html#3party) that I don't think you did in this case. This has legal implications that make our life more difficult.

Dependencies come and go so its fine to add/remove things when it makes sense to do so.

Copy link
Contributor Author

@BsoBird BsoBird Jun 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zabetak @deniskuzZ @okumin
I replaced the UUIDv7 generation code with the uuid-creator dependency introduced via Maven.
Currently, I am still using the high-performance code to generate UUIDv7: Guid.v7().toString().
If we need to enforce the use of cryptographically secure UUIDv7, please let me know.

-- Meters ----------------------------------------------------------------------
uuid-creator:use-threadlocal-random
count = 165457936
mean rate = 18360701.04 events/second
1-minute rate = 17878586.40 events/second
5-minute rate = 17878586.40 events/second
15-minute rate = 17878586.40 events/second

Copy link
Contributor Author

@BsoBird BsoBird Jun 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UUID users expect every generated ID to be unique. The random generator needs to be as secure as we can guarantee the uniqueness in my opinion. As for the dependency, using mvn would make it easier to maintain licenses.

@okumin If the generation speed of UUIDV7 using security-random could be a bit faster (for example, 100,000 pieces of data per second), I wouldn't mind using it. The biggest problem now is that its generation speed is too slow.

In the vast majority of database/data warehouse cases, uuid_v7 is only used as a unique key for records. In this scenario, even if the UUID is predicted, it doesn't seem to have much impact.

Assuming we generate 20 billion uuid_v7 ids, with 300 running in parallel, it would take a full 3.7 hours to complete. However, using the original uuid (uuidv4) function only takes 0.009 hours. This gap is simply too large. I can hardly imagine how the inefficient UUID_V7 could be effectively used in a production environment.

Copy link
Member

@deniskuzZ deniskuzZ Jun 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uuid-creator dependency should be good and project has also multiple stars

> SELECT _FUNC_();
'36718a53-84f5-45d6-8796-4f79983ad49d'""")
@UDFType(deterministic = false)
public class UDFUUIDv7 extends UDF {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to use GenericUDF for new one because UDF is deprecated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

}


private UUID randomUUIDV7() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UUID users expect every generated ID to be unique. The random generator needs to be as secure as we can guarantee the uniqueness in my opinion.
As for the dependency, using mvn would make it easier to maintain licenses.

Copy link
Member

@deniskuzZ deniskuzZ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

public void testUUIDv7() throws Exception {
UDFUUIDv7 udf = new UDFUUIDv7();
String id1 = udf.evaluate(null).toString();
TimeUnit.MILLISECONDS.sleep(1);
Copy link
Member

@deniskuzZ deniskuzZ Jun 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deniskuzZ Because UUIDV7 is incrementing over time, I want to prove that the UUID generated after 1 millisecond must have a larger lexicographical order than the previous UUID.

Copy link
Contributor Author

@BsoBird BsoBird Jun 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deniskuzZ There is no such thing as asynchronously waiting for the previous task to complete before executing the next task here. I just want to delay the time by 1 millisecond before generating the second UUIDV7 and ensure that the latter UUID is definitely greater than the former one. If this condition is not met, then the UUID we generated must not comply with the UUIDV7 specification.

String v7_uuid_id1 = udf.evaluate(null).toString();
TimeUnit.MILLISECONDS.sleep(1);
String v7_uuid_id2 = udf.evaluate(null).toString();
assertTrue(v7_uuid_id2.compareTo(v7_uuid_id1)>0);  //id2 must be greater than ID1

If we don't verify the time-sequential characteristic of UUIDV7 in unit tests, that seems too strange.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not even sure we need such kind of tests in this repo. This seems to be testing the correctness of the underlying implementation in com.github.f4b6a3 so such tests should be contributed there if they don't exist already.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on the above point

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the test code for verifying UUIDV7 increment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deniskuzZ @zabetak
Sir, I have replaced the test cases and removed Thread.sleep(). However, I still personally believe that some basic test cases should be added. Not adding any tests feels very strange. Based on consistent experience, we can assume that code without test cases simply does not work properly. Therefore, not adding test cases makes me feel very uncomfortable. (Maybe I have OCD.)😝

Copy link
Member

@deniskuzZ deniskuzZ Jun 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BsoBird, I think we need a functional test that validates the actual UDF invocation end-to-end.
I've only found 1 q tests (ppd2.q) that uses the uuid().
Can we extend one of the existing udf*.q tests or create a new one?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An end-to-end test in a .q file is the way to go.

Given that we depend on an external library I would also suggest to run a smoke test by starting a Hive docker container (with these changes), set tez.local.mode=false, and execute a simple query making use of the new function. It's quite common to see ClassNotFoundException when new libraries are added in the ql module.

Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants