Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-535: Make writeAllFields more efficient #324

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

garlicbulbxian
Copy link

This will improve the write performance by 1/3 - 1/4 based on my testing. It makes a huge difference when dealing with really large files (several GBs).

In the original implementation, a significant time was spent on creating unnecessary Java objects.

@lw-lin
Copy link
Contributor

lw-lin commented Feb 14, 2016

Thanks for the patch @garlicccbulb ! I'll look into this and run some tests.

Could you add a Parquet JIRA issue for this and add it to this PR's summary?
We only merge PRs that start like "PARQUET-NNN: Make writeAllFields more efficient".

@garlicbulbxian garlicbulbxian changed the title Make writeAllFields more efficient PARQUET-535: Make writeAllFields more efficient Feb 16, 2016
@garlicbulbxian
Copy link
Author

Thanks @proflin., I created the jira in PARQUT queue, https://issues.apache.org/jira/browse/PARQUET-535.

Let me know if anything else to do.

@garlicbulb-puzhuo
Copy link

Can someone take a quick look at this?

@julienledem
Copy link
Member

This looks good to me.
@lukasnalezenec: comments?

@julienledem
Copy link
Member

@lw-lin @lukasnalezenec: any more comments?

@garlicbulb-puzhuo garlicbulb-puzhuo force-pushed the ProtoWriteSupport branch 2 times, most recently from 6038af3 to 582af43 Compare June 19, 2018 04:40
@garlicbulb-puzhuo
Copy link

@julienledem @lw-lin @lukasnalezenec Sorry this one has been delayed for a while. I updated the PR to work with the new semantics with protobuf extensions. In a nutshell, there isn't much to do with extensions, due to the following facts.

  • Whether there are extension fields for a given extendable protobuf schema can only be checked at runtime.
  • The public API of protobuf provides to check the above needs the full deserialization and creations of the wrapper maps.

However, we can still benefit from the improvement from non-extendable protobuf schemas for large files (millions of rows).

@garlicbulb-puzhuo
Copy link

@julienledem @lw-lin @lukasnalezenec Any thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants