Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-1647: [Java][Parquet] Implement FLOAT16 logical type #1142

Merged
merged 1 commit into from
Dec 9, 2023

Conversation

zhangjiashen
Copy link
Contributor

@zhangjiashen zhangjiashen commented Sep 17, 2023

This is to implement logical type Float16.

Jira

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:
  • Unit Tests
  • E2E Tests

Commits

  • My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain Javadoc that explain what it does

@wgtmac wgtmac changed the title [Parquet-1647] Add logical type FLOAT16 PARQUET-1647: Add logical type FLOAT16 Sep 18, 2023
@benibus
Copy link

benibus commented Sep 18, 2023

CI failures are likely due to the fact that the addition of the logical type to parquet-format is unmerged, so the specific PR branch needs to be manually installed for the build to pass. I'm not sure if there's a good solution yet, as this implementation needs to be present for said parquet-format PR to be voted on and merged.

Copy link

@benibus benibus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This LGTM so far (good work on the conversions), just a few comments.

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @zhangjiashen!

I think there are some follow-up work items:

  • Add a Float16Statistics as suggested by @benibus. Internally it leverages the BINARY_AS_FLOAT16_COMPARATOR to collect min/max values. Also make sure it is well covered in the unit test.
  • Add some e2e tests to make sure statistics are correctly collected by the parquet writer and can be read from the parquet reader.

@wgtmac
Copy link
Member

wgtmac commented Sep 20, 2023

CI failures are likely due to the fact that the addition of the logical type to parquet-format is unmerged, so the specific PR branch needs to be manually installed for the build to pass. I'm not sure if there's a good solution yet, as this implementation needs to be present for said parquet-format PR to be voted on and merged.

I agree that this is not elegant. For now, we can only review the functionality and completeness of the current implementation. Once the format change is submitted (and released), we need to take another pass on this PR.

@zhangjiashen
Copy link
Contributor Author

CI failures are likely due to the fact that the addition of the logical type to parquet-format is unmerged, so the specific PR branch needs to be manually installed for the build to pass. I'm not sure if there's a good solution yet, as this implementation needs to be present for said parquet-format PR to be voted on and merged.

Agree! we can merge PR first after this diff is ready

@zhangjiashen zhangjiashen changed the title PARQUET-1647: Add logical type FLOAT16 PARQUET-1647: [Java][Parquet] Implement logical type FLOAT16 Sep 24, 2023
@zhangjiashen zhangjiashen changed the title PARQUET-1647: [Java][Parquet] Implement logical type FLOAT16 PARQUET-1647: [Java][Parquet] Implement FLOAT16 logical type Sep 24, 2023
@zhangjiashen
Copy link
Contributor Author

@wgtmac @benibus please help revisit this PR once you get a chance?

@wgtmac
Copy link
Member

wgtmac commented Sep 26, 2023

@wgtmac @benibus please help revisit this PR once you get a chance?

I will take a look later this week. cc @shangxinli @gszadovszky

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the comments quickly! I have left some comments mainly with regard to the issue that we should only use FIXED_LENGTH_BYTE_ARRAY type not BINARY type.

Other than that, we are missing an E2E test case. Please follow what https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestParquetWriterTruncation.java does to write some float16 values in several row groups and check if the statistics from reader are correct.

Copy link
Contributor Author

@zhangjiashen zhangjiashen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wgtmac @benibus @gszadovszky Thanks a lot for reviewing this PR and providing good suggestions, most of them are addressed and could you please help take a look again in case I miss any, should we first merge parquet-format PR and release a lib for this PR if we think this PR is ready? CC: @shangxinli

zhangjiashen

This comment was marked as duplicate.

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhangjiashen zhangjiashen force-pushed the parquet-1647-float16 branch 2 times, most recently from 9c43d6c to d64761e Compare October 15, 2023 04:52
Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zhangjiashen!

+1

@wgtmac
Copy link
Member

wgtmac commented Oct 27, 2023

apache/parquet-format#184 is merged. Could you try to set parquet.format.version to 2.10.0-SNAPSHOT in the pom.xml and check if the CIs are green?

@gszadovszky
Copy link
Contributor

@wgtmac, I don't think we automatically deploy snapshot versions. And, we will need a final release of parquet-format anyway, before we can get this one merged.

@wgtmac
Copy link
Member

wgtmac commented Oct 28, 2023

@wgtmac, I don't think we automatically deploy snapshot versions. And, we will need a final release of parquet-format anyway, before we can get this one merged.

OK, then let's wait until format v2.10 is released. Once two PoC implementations of apache/parquet-format#197 have been finished, I will kick off the release process.

@wgtmac
Copy link
Member

wgtmac commented Nov 24, 2023

@zhangjiashen This can be rebased to adopt parquet-format 2.10.0

@zhangjiashen
Copy link
Contributor Author

@zhangjiashen This can be rebased to adopt parquet-format 2.10.0

@wgtmac I just rebased with master branch and please help take a look when you get a chance?

@zhangjiashen zhangjiashen force-pushed the parquet-1647-float16 branch 3 times, most recently from 8e77307 to 323899e Compare November 29, 2023 04:20
@wgtmac
Copy link
Member

wgtmac commented Nov 30, 2023

Could you please rebase it?

@zhangjiashen
Copy link
Contributor Author

Could you please rebase it?

Rebased, can you help merge this PR?

pom.xml Outdated Show resolved Hide resolved
Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Thanks @zhangjiashen for working on this!

@wgtmac
Copy link
Member

wgtmac commented Dec 4, 2023

BTW, it would be good to add an interoperability test (as a follow-up PR) to read parquet files from here: apache/parquet-testing@da467da. You may want to take a look at this example: https://github.com/apache/parquet-mr/blob/44b56225be6fe7b74667f4f2430326ef1f076cc5/parquet-hadoop/src/test/java/org/apache/parquet/hadoop/codec/TestInteropReadLz4RawCodec.java#L40

@wgtmac
Copy link
Member

wgtmac commented Dec 5, 2023

Will merge this by the end of this week if there is no objection.

Copy link

@benibus benibus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zhangjiashen for your persistence on this! In case I didn't already make this clear, there were a lot of eyes on this project and your contributions were crucial to its success - so thank you for stepping up here.

@wgtmac wgtmac merged commit c061d54 into apache:master Dec 9, 2023
9 checks passed
@wgtmac
Copy link
Member

wgtmac commented Dec 9, 2023

Merged this. Thanks @zhangjiashen and everyone!

@pitrou
Copy link
Member

pitrou commented Dec 9, 2023

Is it planned to add the interoperability test mentioned in #1142 (comment) ?

@wgtmac
Copy link
Member

wgtmac commented Dec 9, 2023

Is it planned to add the interoperability test mentioned in #1142 (comment) ?

I think so. @pitrou

@zhangjiashen
Copy link
Contributor Author

Is it planned to add the interoperability test mentioned in #1142 (comment) ?

I plan to add some interoperability tests.

@zhangjiashen
Copy link
Contributor Author

Is it planned to add the interoperability test mentioned in #1142 (comment) ?

I think so. @pitrou

Added interOp tests in #1235

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants