Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for ProtobufStructObjectInspector to work correctly with Hive #400

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

rahulrv1980
Copy link

Use builder class since hive expects mutable objects while message objects are immutable

@@ -20,167 +20,201 @@

public final class ProtobufStructObjectInspector extends SettableStructObjectInspector {

public static class ProtobufStructField implements StructField {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why you completely changed the formatting? I'm guessing it was your IDE?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It was. Fixed

@mikebern
Copy link

mikebern commented Jun 6, 2014

Hi Rahul,

I am getting an exception with your fix. I have compiled elephant-bird for hadoop2 and bumped protobuf to 2.5.0. I will try to get more debug info next week.

Diagnostic Messages for this Task:
Error: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing writable
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:175)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing writable
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:513)
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:157)
... 8 more
Caused by: java.lang.ClassCastException: java.util.ArrayList cannot be cast to com.google.protobuf.Message$Builder
at com.twitter.elephantbird.hive.serde.ProtobufStructObjectInspector.setStructFieldData(ProtobufStructObjectInspector.java:159)
at org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorConverters$StructConverter.convert(ObjectInspectorConverters.java:396)
at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.readRow(MapOperator.java:152)
at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.access$300(MapOperator.java:125)
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:488)
... 9 more

@rahulrv1980
Copy link
Author

I think this is likely because of a repeated field in your schema. We don't have any, so, I did not do much testing around that. There were existing unit tests which dealt with repeated fields, and I essentially got them to pass. I can take a look at it later this week

@rahulrv1980
Copy link
Author

I think I have a fix. Let me know if you are ready to try it

@mikebern
Copy link

mikebern commented Jun 9, 2014

Rahul, thanks a lot. I'll test it first thing tomorrow morning.

@rahulrv1980
Copy link
Author

Mike, please try the code in the branch : https://github.com/rahulrv1980/elephant-bird/tree/RepeatedFieldFix. Also, if it does not work, could you write a test case for the failure and I can look to fix it

@mikebern
Copy link

mikebern commented Jun 9, 2014

Rahul, great - the previous exception is gone, but I am seeing another one:

Caused by: java.lang.ClassCastException: Logentries$Impression$NestedMessageType cannot be cast to com.google.protobuf.Message$Builder
at com.twitter.elephantbird.hive.serde.ProtobufStructObjectInspector.getStructFieldData(ProtobufStructObjectInspector.java:193)
at org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorConverters$StructConverter.convert(ObjectInspectorConverters.java:394)
at org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorConverters$ListConverter.convert(ObjectInspectorConverters.java:337)
at org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorConverters$StructConverter.convert(ObjectInspectorConverters.java:395)
at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.readRow(MapOperator.java:152)
at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.access$300(MapOperator.java:125)
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:488)

NestedMessageType is the type of first non-empty repeated nested message in the Impression message.

Let me know if I can help fixing this.

@rahulrv1980
Copy link
Author

Could you provide a test case with your message definition with the failures. There are existing unit tests for ProtobufStructObjectInspector. You could modify an existing one or better still, add a new one mimicking your use case with protocol buffers and getting a repro.

@mikebern
Copy link

Hi Rahul, I have created a patch that does not cause exceptions:
https://gist.github.com/mikebern/825812f0de95a2f03dd5

I will need to verify that the data is correct. I'll try to add unit tests after that.

There seem to be a problem though - I am seeing a huge performance degradation. We have patched Hive to assume that the table Serde and partition Serdes are the same. This way EB works for us on the partitioned tables. If I use your branch and the unpatched Hive, it seems, that Hive tries to convert each row to the table Serde and that causes huge performance loss - I wasn't even able to run select count on 30 sec of our data. I'll post more details later.

@rahulrv1980
Copy link
Author

How are you validating that the change caused the performance degradation? Did the ProtobufStructObjectInspector without my change work at higher performance (Not sure how the old version worked at all). Could you try this on your patched Hive? I am making my first venture into Hive and so, I don't have a baseline to compare performance

@mikebern
Copy link

This is not a bug in your code. You code enables a use case that didn't work before at all with the stock Hive. Specifically, it allows to run queries on Hive tables with partitions. I am still digging into this but here is what I think is going on.

The gist of the performance problem is that Hive looks at each partition Serde separately and tries to find a way to convert partition rows to what it perceives to be the overall table Serde (AFAIK, table serde can be specified explicitly, or, if not, is taken from the first(?) partition).

Under debugger, I see that

hive-0.12.0-cdh5.0.1/src/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/ObjectInspectorConverters.java:147

if (inputOI.equals(outputOI)) {
return new IdentityConverter();
}

check is false, even though both inputOI and outputOI are of the same com.twitter.elephantbird.hive.serde.ProtobufStructObjectInspector type and represent exactly the same schema. This is because ProtobufStructObjectInspector does not override equals/hashCode methods.

Because of the failed check, Hive sees that inputOI and outputOI are different and creates a converter from the former to the latter using your code. It then applies it to every row in the table, which turns out to be a very expensive operation both in execution time and memory.

My change simply monkey patches Hive to always think that inputIO and outputIO are the same object, and with it I have a reasonable performance:

MapOperator.java : 200

if (opCtx.tblRawRowObjectInspector.getClass().getName() != "com.twitter.elephantbird.hive.serde.ProtobufStructObjectInspector") {
opCtx.partTblObjectInspectorConverter = ObjectInspectorConverters.getConverter(
partRawRowObjectInspector, opCtx.tblRawRowObjectInspector);
} else {
opCtx.partTblObjectInspectorConverter = ObjectInspectorConverters.getConverter(
opCtx.tblRawRowObjectInspector, opCtx.tblRawRowObjectInspector);
}

same in FetchOperator.java: 400

if (currTbl != null || serde.getClass().getName() == "com.twitter.elephantbird.hive.serde.ProtobufDeserializer") {
tblSerde = serde;
}
else {
tblSerde = currPart.getTableDesc().getDeserializerClass().newInstance();
tblSerde.initialize(job, currPart.getTableDesc().getProperties());
}

So, to complete your patch we probably need to override equals on the Protobuf inspectors. Let me know if you would have time for this (in case, of course, my assessment is correct). If not, I can do it myself but it will take me longer, since I don't understand the code that well yet.

I will also add more unit tests by the end of the week.

@rahulrv1980
Copy link
Author

Got it. Could you try adding this equals() method?

@OverRide
public boolean equals(Object o) {
if (!(o instanceof ProtobufStructObjectInspector)) {
return false;
}
ProtobufStructObjectInspector that = (ProtobufStructObjectInspector)o;
return (descriptor.equals(that.descriptor) && builder.equals(that.builder));
}

@rahulrv1980
Copy link
Author

I have added your fix + the equals() method change to the branch at https://github.com/rahulrv1980/elephant-bird/tree/RepeatedFieldFix. Let me know if it works

@mikebern
Copy link

Thanks, Rahul. I will try tomorrow morning.

@mikebern
Copy link

I modified the code slightly:

public boolean equals(Object o) {
if (!(o instanceof ProtobufStructObjectInspector)) {
return false;
}
ProtobufStructObjectInspector that = (ProtobufStructObjectInspector)o;
boolean res = descriptor.equals(that.descriptor);
boolean res2 = builder.equals(that.builder);
return res && res2;
}

@OverRide
public int hashCode() {
return descriptor.hashCode() + 7*builder.hashCode();
}

and I see that descriptor.equals(that.descriptor) is true, but builder.equals(that.builder) is false. Do you think that the latter check is essential?.

@rahulrv1980
Copy link
Author

I suppose the descriptor alone may be sufficient. But, it is strange that builder.equals(that.builder) is returning false. Could you share what the builder objects are?

@mikebern
Copy link

Sorry, was busy with getting Spark/Shark working with EB. I will send an update tomorrow.

@mikebern
Copy link

I check the descriptor - in all calls to the equal method the compared objects are actually the same object (and different for the Builder objects). Looks like Descriptor class does not override equals method, so it works only because it's the same object. I tried adding another partition and I see an exception. Not sure if it's related or not.

Ironically, Shark works ok across multiple partitions.

@rahulrv1980
Copy link
Author

Could you dump the different builder objects?

@rahulrv1980
Copy link
Author

@mikebern ping?

@rahulrv1980
Copy link
Author

You could try adding below to the proto file:
option java_generate_equals_and_hash = true;

@jkm137
Copy link

jkm137 commented Nov 6, 2014

I was hitting this issue, so I cloned and built your changes (https://github.com/rahulrv1980/elephant-bird). With a simple table of AddressBooks (com.twitter.elephantbird.examples.proto.AddressBookProtos$AddressBook), now I'm hitting:

Diagnostic Messages for this Task:
Error: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing writable
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:195)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:450)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing writable
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:534)
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:177)
... 8 more
Caused by: java.lang.ClassCastException: java.util.ArrayList cannot be cast to com.google.protobuf.Message$Builder
at com.twitter.elephantbird.hive.serde.ProtobufStructObjectInspector.setStructFieldData(ProtobufStructObjectInspector.java:159)
at org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorConverters$StructConverter.convert(ObjectInspectorConverters.java:396)
at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.readRow(MapOperator.java:154)
at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.access$200(MapOperator.java:127)
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:509)
... 9 more

Thoughts?

Thanks in advance!

@CLAassistant
Copy link

CLAassistant commented Jul 18, 2019

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


Rahul Ravindran seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants