-
Notifications
You must be signed in to change notification settings - Fork 4.8k
HIVE-29040: Add UUIDV7 built in function #5891
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
@zhangbutao hello,Sir. |
} | ||
|
||
|
||
private UUID randomUUIDV7() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BsoBird, where did you get the UUIDv7 impl?
- https://bugs.openjdk.org/browse/JDK-8357251, 8334015: Add Support for UUID Version 7 (UUIDv7) defined in RFC 9562 openjdk/jdk#25303
- https://antonz.org/uuidv7/#java
- import https://github.com/f4b6a3/uuid-creator mvn dependency
UUID uuid = UuidCreator.getTimeOrderedEpoch();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@deniskuzZ https://github.com/robsonkades/uuidv7
This is a fast method for calculating UUIDV7.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/f4b6a3/uuid-creator seems to have higher rating score
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@deniskuzZ Okay,let me see.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@deniskuzZ I reviewed the code and conducted some simple tests. Currently, I have selected a UUIDv7 generation algorithm with relatively the best performance (approximately generating 18 million UUIDs per second in a single thread). Of course, if we need to switch to a different implementation library, I have no objections. What do you think? After all, the UUIDv7 algorithm isn't very complex, so perhaps there's no need to introduce a complicated library?After all, the performance of using SecureRandom + ByteBuffer to generate UUIDV7 is quite poor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UUID users expect every generated ID to be unique. The random generator needs to be as secure as we can guarantee the uniqueness in my opinion.
As for the dependency, using mvn would make it easier to maintain licenses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It really depends if we are talking about adding a new dependency to a huge library or if we are adding a dependency with a single Java class. Copying code is not a good practice and has various disadvantages.
If for instance you copy the code in Hive then we cannot take advantage of improvements/bug fixes/security patches that will land in the dependent library. Nobody, will care to bring these changes to Hive and it may be too late when we realize that somethings is needed.
Moreover, when you copy code from open-source projects and you have to follow the ASF policies (i.e., https://www.apache.org/legal/src-headers.html#3party) that I don't think you did in this case. This has legal implications that make our life more difficult.
Dependencies come and go so its fine to add/remove things when it makes sense to do so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zabetak @deniskuzZ @okumin
I replaced the UUIDv7 generation code with the uuid-creator dependency introduced via Maven.
Currently, I am still using the high-performance code to generate UUIDv7: Guid.v7().toString().
If we need to enforce the use of cryptographically secure UUIDv7, please let me know.
-- Meters ----------------------------------------------------------------------
uuid-creator:use-threadlocal-random
count = 165457936
mean rate = 18360701.04 events/second
1-minute rate = 17878586.40 events/second
5-minute rate = 17878586.40 events/second
15-minute rate = 17878586.40 events/second
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UUID users expect every generated ID to be unique. The random generator needs to be as secure as we can guarantee the uniqueness in my opinion. As for the dependency, using mvn would make it easier to maintain licenses.
@okumin If the generation speed of UUIDV7 using security-random could be a bit faster (for example, 100,000 pieces of data per second), I wouldn't mind using it. The biggest problem now is that its generation speed is too slow.
In the vast majority of database/data warehouse cases, uuid_v7 is only used as a unique key for records. In this scenario, even if the UUID is predicted, it doesn't seem to have much impact.
Assuming we generate 20 billion uuid_v7 ids, with 300 running in parallel, it would take a full 3.7 hours to complete. However, using the original uuid (uuidv4) function only takes 0.009 hours. This gap is simply too large. I can hardly imagine how the inefficient UUID_V7 could be effectively used in a production environment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uuid-creator
dependency should be good and project has also multiple stars
> SELECT _FUNC_(); | ||
'36718a53-84f5-45d6-8796-4f79983ad49d'""") | ||
@UDFType(deterministic = false) | ||
public class UDFUUIDv7 extends UDF { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to use GenericUDF for new one because UDF is deprecated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
} | ||
|
||
|
||
private UUID randomUUIDV7() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UUID users expect every generated ID to be unique. The random generator needs to be as secure as we can guarantee the uniqueness in my opinion.
As for the dependency, using mvn would make it easier to maintain licenses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
public void testUUIDv7() throws Exception { | ||
UDFUUIDv7 udf = new UDFUUIDv7(); | ||
String id1 = udf.evaluate(null).toString(); | ||
TimeUnit.MILLISECONDS.sleep(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@deniskuzZ Because UUIDV7 is incrementing over time, I want to prove that the UUID generated after 1 millisecond must have a larger lexicographical order than the previous UUID.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@deniskuzZ There is no such thing as asynchronously waiting for the previous task to complete before executing the next task here. I just want to delay the time by 1 millisecond before generating the second UUIDV7 and ensure that the latter UUID is definitely greater than the former one. If this condition is not met, then the UUID we generated must not comply with the UUIDV7 specification.
String v7_uuid_id1 = udf.evaluate(null).toString();
TimeUnit.MILLISECONDS.sleep(1);
String v7_uuid_id2 = udf.evaluate(null).toString();
assertTrue(v7_uuid_id2.compareTo(v7_uuid_id1)>0); //id2 must be greater than ID1
If we don't verify the time-sequential characteristic of UUIDV7 in unit tests, that seems too strange.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not even sure we need such kind of tests in this repo. This seems to be testing the correctness of the underlying implementation in com.github.f4b6a3
so such tests should be contributed there if they don't exist already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on the above point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the test code for verifying UUIDV7 increment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@deniskuzZ @zabetak
Sir, I have replaced the test cases and removed Thread.sleep(). However, I still personally believe that some basic test cases should be added. Not adding any tests feels very strange. Based on consistent experience, we can assume that code without test cases simply does not work properly. Therefore, not adding test cases makes me feel very uncomfortable. (Maybe I have OCD.)😝
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BsoBird, I think we need a functional test that validates the actual UDF invocation end-to-end.
I've only found 1 q tests (ppd2.q) that uses the uuid()
.
Can we extend one of the existing udf*.q
tests or create a new one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An end-to-end test in a .q file is the way to go.
Given that we depend on an external library I would also suggest to run a smoke test by starting a Hive docker container (with these changes), set tez.local.mode=false
, and execute a simple query making use of the new function. It's quite common to see ClassNotFoundException
when new libraries are added in the ql module.
|
What changes were proposed in this pull request?
support uuid_v7 function
https://issues.apache.org/jira/browse/HIVE-29040
Why are the changes needed?
Does this PR introduce any user-facing change?
No.
How was this patch tested?
See TestUDFUUID.class