-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-38772: [C++] Implement directory semantics even when the storage account doesn't support HNS #39361
Conversation
…n flat ns accounts
This will fit better together with another boolean I'm adding to DeleteDirContentsOnContainer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not reviewed everything yet. I have a few comments, but generally I think it looks great so far.
cpp/src/arrow/filesystem/azurefs.cc
Outdated
auto dir_marker_blob_path = internal::EnsureTrailingSlash(location.path); | ||
return container_client.GetBlobClient(dir_marker_blob_path).AsBlockBlobClient(); | ||
}, | ||
[](auto& block_blob_client) { block_blob_client.UploadFrom(nullptr, 0); }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be a good idea to put some metadata on all directory marker blobs we create. That would allow us to add a simple checks in ObjectInputFile::Init
and ObjectAppendStream::Init
to ensure they are not being initialised as directories.
Additionally I think this will not be compatible with adlfs
if it does not add metadata to directory markers. I think its very likely that people will end up using a combination of this arrow filesystem and adlfs
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I would like to be as compatible as possible with existing specs. But to clarify, what do you mean when you say adlfs
? Don't you mean abfs
instead [1]?
I think there is a confusion between abfs
(filesystem semantics over the Blobs API) and adlfs
(the separate service that Azure created that can also be accessed via the Blobs API with some restrictions).
The fsspec
repo seems to be linking both abfs
and adlfs
to the same thing, but they shouldn't be considered the same thing: https://github.com/fsspec/filesystem_spec/blob/0ffe06cb767456b7c13904b57ec1c3ca60d53eae/docs/source/api.rst?plain=1#L229
Tell me what I'm missing here because my experience with Azure is restricted just to reading the docs (a lot of them) recently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, in this case I mean the adlfs pip repo. It's the Azure fsspec implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More specifically I think we should add some metadata which will be compatible such that https://github.com/fsspec/adlfs/blob/32132c4094350fca2680155a5c236f2e9f991ba5/adlfs/spec.py#L855-L870 can correctly detect the directories.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot this, but I will add this change now.
@Tom-Newton @kou I pushed changes based on your feedback. I would love to merge it before tomorrow's release cut. |
Could you check CI failures? https://github.com/apache/arrow/actions/runs/7414494286/job/20175680211?pr=39361#step:6:3814
https://github.com/apache/arrow/actions/runs/7414494286/job/20175680211?pr=39361#step:6:3680
|
@kou done. These turned out to be Azurite-only tests that needed updating since the semantics are now different than what was assumed when the tests were written. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit aae6fa4. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
…rage account doesn't support HNS (apache#39361) ### Rationale for this change The `FileSystem` implementation based on Azure Blob Storage should implement directory operations according to filesystem semantics. When Hierarchical Namespace (HNS) is enabled, we can rely on Azure Data Lake Storage Gen 2 APIs implementing the filesystem semantics for us, but when all we have is the Blobs API, we should emulate it. ### What changes are included in this PR? - Skip fewer tests - Re-implement `GetFileInfo` using `ListBlobsByHierarchy` instead of `ListBlobs` - Re-implement `CreateDir` with an upfront HNS support check instead of falling back to Blobs API after an error - Add comprehensive tests to `CreateDir` - Add `HasSubmitBatchBug` to check if a test inside any scenario is affected by a certain Azurite issue - Implement `DeleteDir` to work properly on flat namespace storage accounts (non-HNS accounts) - ### Are these changes tested? Yes. By existing and new tests added by this PR itself. * Closes: apache#38772 Authored-by: Felipe Oliveira Carvalho <[email protected]> Signed-off-by: Felipe Oliveira Carvalho <[email protected]>
…rage account doesn't support HNS (apache#39361) ### Rationale for this change The `FileSystem` implementation based on Azure Blob Storage should implement directory operations according to filesystem semantics. When Hierarchical Namespace (HNS) is enabled, we can rely on Azure Data Lake Storage Gen 2 APIs implementing the filesystem semantics for us, but when all we have is the Blobs API, we should emulate it. ### What changes are included in this PR? - Skip fewer tests - Re-implement `GetFileInfo` using `ListBlobsByHierarchy` instead of `ListBlobs` - Re-implement `CreateDir` with an upfront HNS support check instead of falling back to Blobs API after an error - Add comprehensive tests to `CreateDir` - Add `HasSubmitBatchBug` to check if a test inside any scenario is affected by a certain Azurite issue - Implement `DeleteDir` to work properly on flat namespace storage accounts (non-HNS accounts) - ### Are these changes tested? Yes. By existing and new tests added by this PR itself. * Closes: apache#38772 Authored-by: Felipe Oliveira Carvalho <[email protected]> Signed-off-by: Felipe Oliveira Carvalho <[email protected]>
…rage account doesn't support HNS (apache#39361) ### Rationale for this change The `FileSystem` implementation based on Azure Blob Storage should implement directory operations according to filesystem semantics. When Hierarchical Namespace (HNS) is enabled, we can rely on Azure Data Lake Storage Gen 2 APIs implementing the filesystem semantics for us, but when all we have is the Blobs API, we should emulate it. ### What changes are included in this PR? - Skip fewer tests - Re-implement `GetFileInfo` using `ListBlobsByHierarchy` instead of `ListBlobs` - Re-implement `CreateDir` with an upfront HNS support check instead of falling back to Blobs API after an error - Add comprehensive tests to `CreateDir` - Add `HasSubmitBatchBug` to check if a test inside any scenario is affected by a certain Azurite issue - Implement `DeleteDir` to work properly on flat namespace storage accounts (non-HNS accounts) - ### Are these changes tested? Yes. By existing and new tests added by this PR itself. * Closes: apache#38772 Authored-by: Felipe Oliveira Carvalho <[email protected]> Signed-off-by: Felipe Oliveira Carvalho <[email protected]>
Rationale for this change
The
FileSystem
implementation based on Azure Blob Storage should implement directory operations according to filesystem semantics. When Hierarchical Namespace (HNS) is enabled, we can rely on Azure Data Lake Storage Gen 2 APIs implementing the filesystem semantics for us, but when all we have is the Blobs API, we should emulate it.What changes are included in this PR?
GetFileInfo
usingListBlobsByHierarchy
instead ofListBlobs
CreateDir
with an upfront HNS support check instead of falling back to Blobs API after an errorCreateDir
HasSubmitBatchBug
to check if a test inside any scenario is affected by a certain Azurite issueDeleteDir
to work properly on flat namespace storage accounts (non-HNS accounts)Are these changes tested?
Yes. By existing and new tests added by this PR itself.
GetFileInfo()
against a directory always return true without hierarchical namespace support? #38772