-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Core: Move deleted files to Hadoop trash if configured #14501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
b0ba8b9 to
8d07d49
Compare
8d07d49 to
5cb16cf
Compare
anuragmantri
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @jordepic. This is very useful.
However, it seems like a behavior change (even if trash was enabled previously, Iceberg was not honoring it). IMO, we should make this configurable using a property to avoid surprises (unexpected storage consumption).
| } | ||
|
|
||
| private void deletePath(FileSystem fs, Path toDelete, boolean recursive) throws IOException { | ||
| Trash trash = new Trash(fs, getConf()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned about number of Trash objects we create. Does Hadoop API ensure we can reuse the trash object for a given (fs, conf)?
I couldn't tell from https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/Trash.html#Trash-org.apache.hadoop.fs.FileSystem-org.apache.hadoop.conf.Configuration-
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good call. I've added a new toggle that can be put in the hadoop configuration to determine if we want to use the trash for iceberg, following Russel Spitzer's example in other HadoopFileIO changes.
I've taken a look regarding object reuse. The trash can change due to lots of changes in configuration (meaning I'd have to create a cache based on 5+ configuration values which are susceptible to change in the future), unlike the file system (Key doesn't actually rely on conf, just relies on the URI and user group information). With that being said, the change that I made to check for hadoop configuration first makes it so that we don't create the Trash object unless specifically opted into. I hope that this is good enough for now - an iceberg user will now have to opt into this change to experience any possible object churn.
As of now, the HadoopFileIO uses the Java delete API, which always skips using a configured trash directory. If the table's hadoop configuration has trash enabled, we should use it. In the event of file deletions, we aim to mimic existing behavior, throwing an error when attempting to delete a directory. In the single file delete case, this notably adds an RPC call (to check if a path is a directory) if the trash is enabled.
5cb16cf to
efc6a8f
Compare
As of now, the HadoopFileIO uses the Java delete API, which always skips using a configured trash directory. If the table's hadoop configuration has trash enabled, we should use it. In the event of single file deletions, we aim to mimic existing behavior, throwing an error when attempting to delete a directory. In the single file delete case, this notably adds an RPC call (to check if a path is a directory) if the trash is enabled.