Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-38772: [C++] Implement directory semantics even when the storage account doesn't support HNS #39361

Merged
merged 24 commits into from
Jan 5, 2024

Conversation

felipecrv
Copy link
Contributor

@felipecrv felipecrv commented Dec 24, 2023

Rationale for this change

The FileSystem implementation based on Azure Blob Storage should implement directory operations according to filesystem semantics. When Hierarchical Namespace (HNS) is enabled, we can rely on Azure Data Lake Storage Gen 2 APIs implementing the filesystem semantics for us, but when all we have is the Blobs API, we should emulate it.

What changes are included in this PR?

  • Skip fewer tests
  • Re-implement GetFileInfo using ListBlobsByHierarchy instead of ListBlobs
  • Re-implement CreateDir with an upfront HNS support check instead of falling back to Blobs API after an error
  • Add comprehensive tests to CreateDir
  • Add HasSubmitBatchBug to check if a test inside any scenario is affected by a certain Azurite issue
  • Implement DeleteDir to work properly on flat namespace storage accounts (non-HNS accounts)

Are these changes tested?

Yes. By existing and new tests added by this PR itself.

@felipecrv
Copy link
Contributor Author

@Tom-Newton

Copy link
Contributor

@Tom-Newton Tom-Newton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not reviewed everything yet. I have a few comments, but generally I think it looks great so far.

cpp/src/arrow/filesystem/azurefs.cc Show resolved Hide resolved
cpp/src/arrow/filesystem/azurefs.cc Outdated Show resolved Hide resolved
auto dir_marker_blob_path = internal::EnsureTrailingSlash(location.path);
return container_client.GetBlobClient(dir_marker_blob_path).AsBlockBlobClient();
},
[](auto& block_blob_client) { block_blob_client.UploadFrom(nullptr, 0); },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be a good idea to put some metadata on all directory marker blobs we create. That would allow us to add a simple checks in ObjectInputFile::Init and ObjectAppendStream::Init to ensure they are not being initialised as directories.

Additionally I think this will not be compatible with adlfs if it does not add metadata to directory markers. I think its very likely that people will end up using a combination of this arrow filesystem and adlfs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I would like to be as compatible as possible with existing specs. But to clarify, what do you mean when you say adlfs? Don't you mean abfs instead [1]?

[1] https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/AzureBlobFileSystem.java

I think there is a confusion between abfs (filesystem semantics over the Blobs API) and adlfs (the separate service that Azure created that can also be accessed via the Blobs API with some restrictions).

The fsspec repo seems to be linking both abfs and adlfs to the same thing, but they shouldn't be considered the same thing: https://github.com/fsspec/filesystem_spec/blob/0ffe06cb767456b7c13904b57ec1c3ca60d53eae/docs/source/api.rst?plain=1#L229

Tell me what I'm missing here because my experience with Azure is restricted just to reading the docs (a lot of them) recently.

Copy link
Contributor

@Tom-Newton Tom-Newton Dec 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, in this case I mean the adlfs pip repo. It's the Azure fsspec implementation.

Copy link
Contributor

@Tom-Newton Tom-Newton Jan 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More specifically I think we should add some metadata which will be compatible such that https://github.com/fsspec/adlfs/blob/32132c4094350fca2680155a5c236f2e9f991ba5/adlfs/spec.py#L855-L870 can correctly detect the directories.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot this, but I will add this change now.

cpp/src/arrow/filesystem/azurefs_test.cc Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Dec 24, 2023
cpp/src/arrow/filesystem/azurefs.cc Show resolved Hide resolved
cpp/src/arrow/filesystem/azurefs.cc Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/azurefs.cc Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/azurefs.cc Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/azurefs.cc Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/azurefs.cc Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/azurefs.cc Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Jan 1, 2024
@felipecrv
Copy link
Contributor Author

@Tom-Newton @kou I pushed changes based on your feedback. I would love to merge it before tomorrow's release cut.

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jan 4, 2024
@kou
Copy link
Member

kou commented Jan 4, 2024

Could you check CI failures?

https://github.com/apache/arrow/actions/runs/7414494286/job/20175680211?pr=39361#step:6:3814

[ RUN      ] TestAzuriteFileSystem.DeleteDirContentsSuccessDirectory
/arrow/cpp/src/arrow/filesystem/test_util.cc:143: Failure
Expected equality of these values:
  info.type()
    Which is: FileType::Directory
  type
    Which is: FileType::NotFound
For path 'hfqilxw9uhk7xqbathiwbudo4ci9dp9w/0iav1y3m4u53d5wvtp8g2j5af1o4s694'

https://github.com/apache/arrow/actions/runs/7414494286/job/20175680211?pr=39361#step:6:3680

[ RUN      ] TestAzuriteFileSystem.DeleteDirSuccessNonexistent
/arrow/cpp/src/arrow/filesystem/azurefs_test.cc:1301: Failure
Failed
'fs()->DeleteDir(directory_path)' failed with IOError: Path does not exist 'go9qhglaypjhxnprrxc79a0e3xtceml1/bwumjdfql0n8hizvv1kd3jn75ucrlw4u'. Detail: [errno 2] No such file or directory

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jan 4, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jan 4, 2024
@felipecrv
Copy link
Contributor Author

@kou done. These turned out to be Azurite-only tests that needed updating since the semantics are now different than what was assumed when the tests were written.

@felipecrv felipecrv requested a review from kou January 4, 2024 22:27
@felipecrv felipecrv added this to the 15.0.0 milestone Jan 4, 2024
@felipecrv felipecrv requested a review from Tom-Newton January 5, 2024 01:36
Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Jan 5, 2024
@felipecrv felipecrv merged commit aae6fa4 into apache:main Jan 5, 2024
33 checks passed
@felipecrv felipecrv removed the awaiting merge Awaiting merge label Jan 5, 2024
@felipecrv felipecrv deleted the azure_dir_semantics branch January 5, 2024 15:44
Copy link

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit aae6fa4.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

clayburn pushed a commit to clayburn/arrow that referenced this pull request Jan 23, 2024
…rage account doesn't support HNS (apache#39361)

### Rationale for this change

The `FileSystem` implementation based on Azure Blob Storage should implement directory operations according to filesystem semantics. When Hierarchical Namespace (HNS) is enabled, we can rely on Azure Data Lake Storage Gen 2 APIs implementing the filesystem semantics for us, but when all we have is the Blobs API, we should emulate it.

### What changes are included in this PR?

 - Skip fewer tests
 - Re-implement `GetFileInfo` using `ListBlobsByHierarchy` instead of `ListBlobs`
 - Re-implement `CreateDir` with an upfront HNS support check instead of falling back to Blobs API after an error
 - Add comprehensive tests to `CreateDir`
 - Add `HasSubmitBatchBug` to check if a test inside any scenario is affected by a certain Azurite issue
 - Implement `DeleteDir` to work properly on flat namespace storage accounts (non-HNS accounts)
 - 

### Are these changes tested?

Yes. By existing and new tests added by this PR itself.
* Closes: apache#38772

Authored-by: Felipe Oliveira Carvalho <[email protected]>
Signed-off-by: Felipe Oliveira Carvalho <[email protected]>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…rage account doesn't support HNS (apache#39361)

### Rationale for this change

The `FileSystem` implementation based on Azure Blob Storage should implement directory operations according to filesystem semantics. When Hierarchical Namespace (HNS) is enabled, we can rely on Azure Data Lake Storage Gen 2 APIs implementing the filesystem semantics for us, but when all we have is the Blobs API, we should emulate it.

### What changes are included in this PR?

 - Skip fewer tests
 - Re-implement `GetFileInfo` using `ListBlobsByHierarchy` instead of `ListBlobs`
 - Re-implement `CreateDir` with an upfront HNS support check instead of falling back to Blobs API after an error
 - Add comprehensive tests to `CreateDir`
 - Add `HasSubmitBatchBug` to check if a test inside any scenario is affected by a certain Azurite issue
 - Implement `DeleteDir` to work properly on flat namespace storage accounts (non-HNS accounts)
 - 

### Are these changes tested?

Yes. By existing and new tests added by this PR itself.
* Closes: apache#38772

Authored-by: Felipe Oliveira Carvalho <[email protected]>
Signed-off-by: Felipe Oliveira Carvalho <[email protected]>
zanmato1984 pushed a commit to zanmato1984/arrow that referenced this pull request Feb 28, 2024
…rage account doesn't support HNS (apache#39361)

### Rationale for this change

The `FileSystem` implementation based on Azure Blob Storage should implement directory operations according to filesystem semantics. When Hierarchical Namespace (HNS) is enabled, we can rely on Azure Data Lake Storage Gen 2 APIs implementing the filesystem semantics for us, but when all we have is the Blobs API, we should emulate it.

### What changes are included in this PR?

 - Skip fewer tests
 - Re-implement `GetFileInfo` using `ListBlobsByHierarchy` instead of `ListBlobs`
 - Re-implement `CreateDir` with an upfront HNS support check instead of falling back to Blobs API after an error
 - Add comprehensive tests to `CreateDir`
 - Add `HasSubmitBatchBug` to check if a test inside any scenario is affected by a certain Azurite issue
 - Implement `DeleteDir` to work properly on flat namespace storage accounts (non-HNS accounts)
 - 

### Are these changes tested?

Yes. By existing and new tests added by this PR itself.
* Closes: apache#38772

Authored-by: Felipe Oliveira Carvalho <[email protected]>
Signed-off-by: Felipe Oliveira Carvalho <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants