Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Respect textinputformat.record.delimiter conf setting in LzoLineRecordReader #86

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

ash211
Copy link
Contributor

@ash211 ash211 commented Jan 3, 2014

@ash211
Copy link
Contributor Author

ash211 commented Jan 3, 2014

I spent quite some time writing a unit test for this, but ended up not having one I was satisfied with.

Following the tests in TestLzoTextInputFormat I used a RecordWriter<Text, Text> with data in the key and empty Text() in the value positions. But when reading from the produced file I ended up with spurious tabs all over the place. What's the proper way to write data with the LZO codec in a unit test in a way that the LzoTextInputFormat can read it without having junk characters inserted into the data?

@ash211
Copy link
Contributor Author

ash211 commented Jan 3, 2014

Weird, that method doesn't exist in your CI's version of Hadoop but does on my CDH4.4.0 -- any suggestions for reconciling the two?

@rangadi
Copy link
Contributor

rangadi commented Jan 3, 2014

The patch looks fine. Regd, unit test, I don't think extra tabs are added by hadoop-lzo anywhere. Can you update the patch with the test you have?

@rangadi
Copy link
Contributor

rangadi commented Jan 3, 2014

What version is CI using? this method exists in all the hadoop versions I think.

@ash211
Copy link
Contributor Author

ash211 commented Jan 3, 2014

You're right that hadoop-lzo doesn't add additional tabs, I mean that the RecordWriter I'm using to create the test file puts a tab between the key and value of the record, which is then picked up by the RecordReader as data when reading it back.

ash211@dcbad20#diff-6e30a0c822cde1296ce246e316c91d59R81 -- write

ash211@dcbad20#diff-6e30a0c822cde1296ce246e316c91d59R104 -- read

@ash211
Copy link
Contributor Author

ash211 commented Jan 3, 2014

I found these lines in the Travis CI logs:

Downloading: https://oss.sonatype.org/content/repositories/releases/org/apache/hadoop/hadoop-core/1.0.4/hadoop-core-1.0.4.pom
Downloading: https://repository.apache.org/releases/org/apache/hadoop/hadoop-core/1.0.4/hadoop-core-1.0.4.pom
Downloading: http://repo.maven.apache.org/maven2/org/apache/hadoop/hadoop-core/1.0.4/hadoop-core-1.0.4.pom
Downloaded: http://repo.maven.apache.org/maven2/org/apache/hadoop/hadoop-core/1.0.4/hadoop-core-1.0.4.pom (5 KB at 39.0 KB/sec)

So looks like 1.0.4

@sjlee
Copy link
Collaborator

sjlee commented Jan 6, 2014

Actually hadoop-lzo is built against two versions of hadoop: 1.0.4 and 2.1.0 (see https://github.com/twitter/hadoop-lzo/blob/master/pom.xml#L91). So the code must build and test cleanly against both versions.

We do have a small compatibility utility where we have to deal with binary- or API-compatibility issues (https://github.com/twitter/hadoop-lzo/blob/master/src/main/java/com/hadoop/compression/lzo/util/CompatibilityUtil.java).

The suggestion is to stick with the APIs that are unchanged in both versions. If you must use APIs that are affected by the incompatibility, then you may need to look at the compatibility util and adopt or augment it.

Hope this helps.

@ash211
Copy link
Contributor Author

ash211 commented Jan 6, 2014

Hi @sjlee ,

From my reading of the 1.0 source -- https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0/src/core/org/apache/hadoop/util/LineReader.java -- it looks like 1.0.4's LineReader doesn't have the method I need, and doesn't have an alternative to use either.

I think the desired behavior I'd like then is if you're running against the 2.0 APIs and have the conf setting set it uses that conf setting, but if you're running against 1.0 APIs and have it set then you get a warning that the conf setting only works in the new APIs. If you're running against either and don't have the conf setting set, things work as before.

Is that the right goal to aim for?

It should be pretty easy to set this up if I can detect which API version is being used and CompatibilityUtil. isVersion2x() looks like exactly what I'd need for that.

@rangadi
Copy link
Contributor

rangadi commented Jan 6, 2014

sounds good. Looks like CDH includes this API, even with hadoop-1x. This is a fairly useful feature. Rather than checking hadoop version, you can check if the constructor exists when the config is set. Otherwise initialize it without delimiter.

@ash211
Copy link
Contributor Author

ash211 commented Jan 7, 2014

You'd suggest doing that check with reflection, checking for the
constructor on the class?

On Mon, Jan 6, 2014 at 2:03 PM, Raghu Angadi [email protected]:

sounds good. Looks like CDH includes this API, even with hadoop-1x. This
is a fairly useful feature. Rather than checking hadoop version, you can
check if the constructor exists when the config is set. Otherwise
initialize it without delimiter.


Reply to this email directly or view it on GitHubhttps://github.com//pull/86#issuecomment-31692194
.

@rangadi
Copy link
Contributor

rangadi commented Jan 7, 2014

Yep, IMHO.

@ash211
Copy link
Contributor Author

ash211 commented Jan 7, 2014

Okay I'll give that a shot and see what I come up with. Might be a day or
two before I get it back here.

On Mon, Jan 6, 2014 at 4:34 PM, Raghu Angadi [email protected]:

Yep, IMHO.


Reply to this email directly or view it on GitHubhttps://github.com//pull/86#issuecomment-31702761
.

@ash211
Copy link
Contributor Author

ash211 commented Jul 8, 2014

So apparently I never quite got back to this! Fixing the test issue is probably something someone else should take on.

@rangadi or @sjlee would you be able to help push this PR over the edge and get it committed?

@CLAassistant
Copy link

CLAassistant commented Jul 18, 2019

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants