Skip to content

Conversation

@jordepic
Copy link

@jordepic jordepic commented Nov 4, 2025

As of now, the HadoopFileIO uses the Java delete API, which always skips using a configured trash directory. If the table's hadoop configuration has trash enabled, we should use it. In the event of single file deletions, we aim to mimic existing behavior, throwing an error when attempting to delete a directory. In the single file delete case, this notably adds an RPC call (to check if a path is a directory) if the trash is enabled.

@github-actions github-actions bot added the core label Nov 4, 2025
@jordepic jordepic force-pushed the HADOOP_FILE_IO_CHANGE branch from b0ba8b9 to 8d07d49 Compare November 4, 2025 16:50
@jordepic jordepic force-pushed the HADOOP_FILE_IO_CHANGE branch from 8d07d49 to 5cb16cf Compare November 5, 2025 16:14
Copy link
Contributor

@anuragmantri anuragmantri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @jordepic. This is very useful.

However, it seems like a behavior change (even if trash was enabled previously, Iceberg was not honoring it). IMO, we should make this configurable using a property to avoid surprises (unexpected storage consumption).

}

private void deletePath(FileSystem fs, Path toDelete, boolean recursive) throws IOException {
Trash trash = new Trash(fs, getConf());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned about number of Trash objects we create. Does Hadoop API ensure we can reuse the trash object for a given (fs, conf)?
I couldn't tell from https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/Trash.html#Trash-org.apache.hadoop.fs.FileSystem-org.apache.hadoop.conf.Configuration-

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good call. I've added a new toggle that can be put in the hadoop configuration to determine if we want to use the trash for iceberg, following Russel Spitzer's example in other HadoopFileIO changes.

I've taken a look regarding object reuse. The trash can change due to lots of changes in configuration (meaning I'd have to create a cache based on 5+ configuration values which are susceptible to change in the future), unlike the file system (Key doesn't actually rely on conf, just relies on the URI and user group information). With that being said, the change that I made to check for hadoop configuration first makes it so that we don't create the Trash object unless specifically opted into. I hope that this is good enough for now - an iceberg user will now have to opt into this change to experience any possible object churn.

As of now, the HadoopFileIO uses the Java delete
API, which always skips using a configured trash
directory. If the table's hadoop configuration
has trash enabled, we should use it. In the event
of file deletions, we aim to mimic existing behavior,
throwing an error when attempting to delete a directory.
In the single file delete case, this notably adds an
RPC call (to check if a path is a directory)
if the trash is enabled.
@jordepic jordepic force-pushed the HADOOP_FILE_IO_CHANGE branch from 5cb16cf to efc6a8f Compare November 6, 2025 15:25
@danielcweeks danielcweeks self-requested a review November 10, 2025 19:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants