-
Notifications
You must be signed in to change notification settings - Fork 387
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix for ProtobufStructObjectInspector to work correctly with Hive #400
base: master
Are you sure you want to change the base?
Conversation
@@ -20,167 +20,201 @@ | |||
|
|||
public final class ProtobufStructObjectInspector extends SettableStructObjectInspector { | |||
|
|||
public static class ProtobufStructField implements StructField { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason why you completely changed the formatting? I'm guessing it was your IDE?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. It was. Fixed
Hi Rahul, I am getting an exception with your fix. I have compiled elephant-bird for hadoop2 and bumped protobuf to 2.5.0. I will try to get more debug info next week. Diagnostic Messages for this Task: |
I think this is likely because of a repeated field in your schema. We don't have any, so, I did not do much testing around that. There were existing unit tests which dealt with repeated fields, and I essentially got them to pass. I can take a look at it later this week |
I think I have a fix. Let me know if you are ready to try it |
Rahul, thanks a lot. I'll test it first thing tomorrow morning. |
Mike, please try the code in the branch : https://github.com/rahulrv1980/elephant-bird/tree/RepeatedFieldFix. Also, if it does not work, could you write a test case for the failure and I can look to fix it |
Rahul, great - the previous exception is gone, but I am seeing another one: Caused by: java.lang.ClassCastException: Logentries$Impression$NestedMessageType cannot be cast to com.google.protobuf.Message$Builder NestedMessageType is the type of first non-empty repeated nested message in the Impression message. Let me know if I can help fixing this. |
Could you provide a test case with your message definition with the failures. There are existing unit tests for ProtobufStructObjectInspector. You could modify an existing one or better still, add a new one mimicking your use case with protocol buffers and getting a repro. |
Hi Rahul, I have created a patch that does not cause exceptions: I will need to verify that the data is correct. I'll try to add unit tests after that. There seem to be a problem though - I am seeing a huge performance degradation. We have patched Hive to assume that the table Serde and partition Serdes are the same. This way EB works for us on the partitioned tables. If I use your branch and the unpatched Hive, it seems, that Hive tries to convert each row to the table Serde and that causes huge performance loss - I wasn't even able to run select count on 30 sec of our data. I'll post more details later. |
How are you validating that the change caused the performance degradation? Did the ProtobufStructObjectInspector without my change work at higher performance (Not sure how the old version worked at all). Could you try this on your patched Hive? I am making my first venture into Hive and so, I don't have a baseline to compare performance |
This is not a bug in your code. You code enables a use case that didn't work before at all with the stock Hive. Specifically, it allows to run queries on Hive tables with partitions. I am still digging into this but here is what I think is going on. The gist of the performance problem is that Hive looks at each partition Serde separately and tries to find a way to convert partition rows to what it perceives to be the overall table Serde (AFAIK, table serde can be specified explicitly, or, if not, is taken from the first(?) partition). Under debugger, I see that hive-0.12.0-cdh5.0.1/src/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/ObjectInspectorConverters.java:147 if (inputOI.equals(outputOI)) { check is false, even though both inputOI and outputOI are of the same com.twitter.elephantbird.hive.serde.ProtobufStructObjectInspector type and represent exactly the same schema. This is because ProtobufStructObjectInspector does not override equals/hashCode methods. Because of the failed check, Hive sees that inputOI and outputOI are different and creates a converter from the former to the latter using your code. It then applies it to every row in the table, which turns out to be a very expensive operation both in execution time and memory. My change simply monkey patches Hive to always think that inputIO and outputIO are the same object, and with it I have a reasonable performance: MapOperator.java : 200 if (opCtx.tblRawRowObjectInspector.getClass().getName() != "com.twitter.elephantbird.hive.serde.ProtobufStructObjectInspector") { same in FetchOperator.java: 400 if (currTbl != null || serde.getClass().getName() == "com.twitter.elephantbird.hive.serde.ProtobufDeserializer") { So, to complete your patch we probably need to override equals on the Protobuf inspectors. Let me know if you would have time for this (in case, of course, my assessment is correct). If not, I can do it myself but it will take me longer, since I don't understand the code that well yet. I will also add more unit tests by the end of the week. |
Got it. Could you try adding this equals() method? @OverRide |
I have added your fix + the equals() method change to the branch at https://github.com/rahulrv1980/elephant-bird/tree/RepeatedFieldFix. Let me know if it works |
Thanks, Rahul. I will try tomorrow morning. |
I modified the code slightly: public boolean equals(Object o) { @OverRide and I see that descriptor.equals(that.descriptor) is true, but builder.equals(that.builder) is false. Do you think that the latter check is essential?. |
I suppose the descriptor alone may be sufficient. But, it is strange that builder.equals(that.builder) is returning false. Could you share what the builder objects are? |
Sorry, was busy with getting Spark/Shark working with EB. I will send an update tomorrow. |
I check the descriptor - in all calls to the equal method the compared objects are actually the same object (and different for the Builder objects). Looks like Descriptor class does not override equals method, so it works only because it's the same object. I tried adding another partition and I see an exception. Not sure if it's related or not. Ironically, Shark works ok across multiple partitions. |
Could you dump the different builder objects? |
@mikebern ping? |
You could try adding below to the proto file: |
I was hitting this issue, so I cloned and built your changes (https://github.com/rahulrv1980/elephant-bird). With a simple table of AddressBooks (com.twitter.elephantbird.examples.proto.AddressBookProtos$AddressBook), now I'm hitting: Diagnostic Messages for this Task: Thoughts? Thanks in advance! |
Rahul Ravindran seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
Use builder class since hive expects mutable objects while message objects are immutable