-
Notifications
You must be signed in to change notification settings - Fork 416
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: compatible to write to local file systems that do not support hard link #1868
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into this @RobinLin666.
AFAIK, regular rename will relace the original file if it already exists. As such it is not safe to use for delta commits.
For S3 lock stores we have an setting ALLOW_UNSAFE_REANME
to allow writing to S3 without a lock client. I would be fine adding this if we also require that setting to be set, such that users are aware they themselves have to make sure there are no concurrent writers to the table.
Thanks for your suggestion, @roeap. I've added a new option, Looking forward to your insights. |
Certainly ... The basic idea to make commits atomic is to write a temporary file, renaming it to the desired commit file, and relying on the file system to error if the file already exists. So in case of just using regular rename, concurrent writers might overwrite each others commits. |
Yes, if the target already exists, it will replace it. But in this scenario, we require that the target file does not exist, if it does, it will report an error directly, and the next rename operation will not be performed, so I think it is still atomic. |
Well, we went over this several times, also with the core team and general consensus was that an a-priori check without any lock mechanism would not prevent any race conditions. This is especially true if the fs has significant latencies, like for remote stores. I think you might be able to validate this by adopting the s3 concurrency tests. |
Yes! You are right! |
Hi @roeap, do you have any other concerns? I can modify them at any time. Please feel free to review, thanks! |
@RobinLin666 what's the impact if multiple people write at the same time to a mounted storage? |
Hi @ion-elgreco writing to different directories only from different instances" will work if you set --use-attr-cache=false and --file-cache-timeout-in-seconds=0. But cannot write to a same blob/file at same time. Here is some detail: Azure/azure-storage-fuse#366 . |
kindly ping |
Hi team, just wanted to kindly bump this PR for review. Thank you! |
it will be really nice to have this functionality as it is very useful in Fabric notebook please :) |
8da7d52
to
2ed6a2e
Compare
Since delta-rs has been refactored, I have also rearranged the code here. Please review it again. Thank you! Note: To completely fix this problem, you need to upgrade arrow-rs object_store to version 0.9. I wrote delta table to mounted path in my local test and it passed. |
Hi @MrPowers, could you please help to review the PR, since you refined the code last week. Thanks |
Hi @roeap / @wjones127 / @rtyler / @fvaleye / @ion-elgreco / @MrPowers Could you please help to review the PR? it will be really nice to have this functionality as it is very useful in Fabric notebook and in Databricks Notebook (dbfs://) please :) |
Hi guys, we are preparing a new Fabric Notebook which has a pure python environment, so we want to depend on this library to provide users with an enjoyable experience to operate Lakehouse Tables, whether through abfss path or local mounted path. |
This would be a really useful PR for other file systems like HDFS. |
Excuse me, Hi @roeap / @wjones127 / @rtyler / @fvaleye / @ion-elgreco / @MrPowers Could you please help to review the PR? Thank you very much! |
Excuse me, Hi @roeap / @wjones127 / @rtyler / @fvaleye / @ion-elgreco / @MrPowers |
Nice to see your reply. I just updated it. Thank you! |
Hi @ion-elgreco Have a good day! |
HEy @RobinLin666 @ion-elgreco, sorry for being MIA for a while. Since blob-fuse is a valid use case, I guess we have to do simething about it. generally I was hoping to get rind of all custom file system implementation in delta-rs, but it seems we will have to retain something after all. One thing though - could we make the config handling analogus to how we handle configuration in object store and also our itnernal config - e.g. |
Hi @roeap thank you for your suggestion, I wrote it with reference to the implementation of aws, supporting environment variables or passing parameters to pass the configuration. |
Hi @roeap ? |
Hello @roeap is there any concern ? |
Hi @roeap , we greatly rely on this PR as many users strongly demand the ability to write data to the mount point, not only with blobfuse but also potentially with tools like s3fs, Databricks' dbfs, and others. If you believe that the code needs refactoring, could you provide more detailed guidance on how it should be refactored? Alternatively, it would be ideal if we could proceed with this PR first, and then I can prepare a refactoring PR. I prefer not to privately compile a delta-rs installation for our product, as it would be cumbersome to maintain and not practical. Thank you. |
@RobinLin666 I think what Robert means it to make this akin to crates/azure/src/config.rs, in parsing and setting configuration. Does that help? |
1253161
to
27c88c9
Compare
Hi @roeap / @ion-elgreco , I have refined the code, please help to review, thanks! |
@RobinLin666 Looks good! Can you fix the CI failures? |
Hi @roeap / @ion-elgreco can you help to review again? |
7f2e793
to
fefc7c9
Compare
@RobinLin666 I don't see a test anymore in Python, could you add one? Once that's there we are good to go. |
Thanks @ion-elgreco, done. |
9ae35f0
to
306f7bc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RobinLin666 thanks!
Hi @ion-elgreco thank you very much for your approval!!! |
compatible to write to local file systems that do not support hard link.
Description
When we write to the local file system, sometimes hard link is not supported, such as blobfuse, goofys, s3fs, so deal with it with compatibility.
It is important to note that:
There is another problem with blobfuse, that is, when it comes to rename, it will report errors. Because rename did not release the file handle before.
See here for details: #1765
Arrow-rs is required to cooperate with the modification, for example: https://github.com/GlareDB/arrow-rs/pull/2/files
Because object_store has been upgraded to 0.8, there are a lot of breaking change, so I haven't changed this one for the time being. Will fix it after upgrading to 0.8 #1858
Related Issue(s)
#1765
#1376
Documentation