Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for GDPR erase requirements #341

Open
asfimport opened this issue Nov 6, 2017 · 4 comments
Open

Support for GDPR erase requirements #341

asfimport opened this issue Nov 6, 2017 · 4 comments

Comments

@asfimport
Copy link
Collaborator

As understand it Parquet is a write once thing. So mutating data inside Parquet files is not an option. Now there is a new cross EU law coming in effect May 2018 that requires companies to delete data pertaining a customer if being asked to do so.

Our case is quite simple, our biggest parquet tables collect 7.5 billion rows a month. So removing data by duplicating this table whilst filtering out the unwanted customer data is not feasible.

Perhaps there is some way to remove particular data? Or perhaps there is an efficient way to do read/filter/write? Perhaps zeroing the data is an idea to not change the layout of the files.

Not sure if this is the right platform to start this discussion but I think more people will have this issue once it becomes clear that data needs to be deleted in all places, also in parquet files. Companies fase multi million dollar fines if they don't comply with GDPR.

Reporter: Machiel Groeneveld

Note: This issue was originally created as PARQUET-1155. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Atri Sharma / @atris:
Is this issue being actively worked on? I would like to actively work on this one if its open.

@asfimport
Copy link
Collaborator Author

Kai Liu:
Hi, Machiel,
Did you get any attraction on this issue? And would you mind share the approach you are taking to address GDPR requirement in your system now?

Kai Liu

@asfimport
Copy link
Collaborator Author

Machiel Groeneveld:
Hi [~zjumad]  there is no news from the Parquet side. Though a recent development in the community to deal with this problem is Delta Lake. They add a layer on top of parquet to allow for deletions, although parquet is still read only. 

@asfimport
Copy link
Collaborator Author

Fokko Driesprong / @Fokko:
I don't see this being implemented in Apache Parquet. However, Delta lake, and also Apache Iceberg can solve the issue that you're describing. With Delta, make sure that you vacuum, otherwise the data will be still on the disks :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants