Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support jvm version libhdfs in velox #9835

Closed
wants to merge 2 commits into from

Conversation

JkSelf
Copy link
Collaborator

@JkSelf JkSelf commented May 16, 2024

Currently, Gluten will throw hdfs connection failures when executing queries if their HDFS system employs Kerberos authentication and Viewfs support. This is due to the fact that the existing libhdfs3 API does not support Kerberos authentication, whereas the JVM version of libhdfs is capable of invoking APIs that support Kerberos authentication. If the user's system has the HADOOP_HOME environment variable set, the JVM version of libhdfs will be used during the compilation of Gluten; if not set, the default libhdfs3 will be used instead.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 16, 2024
Copy link

netlify bot commented May 16, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 9c058b0
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/67184d321be84b0008529d2a

@JkSelf
Copy link
Collaborator Author

JkSelf commented May 16, 2024

@mbasmanova @majetideepak Can you help to review? Thanks.

@JkSelf JkSelf force-pushed the libhdfs branch 3 times, most recently from fbb5763 to a33070c Compare May 16, 2024 03:39
@mbasmanova
Copy link
Contributor

@JkSelf CI is red. Please, take a look.

@assignUser
Copy link
Collaborator

The C++ libhdfs3 is unmaintained (no commits in 3+ years, hawq is moving into the attic (asf 'archive')) so I think this should replace it entirely! @pedroerp @kgpai we talked about this before, I think this is what arrow does as well but I haven't had a look yet.

@kgpai
Copy link
Contributor

kgpai commented May 22, 2024

cc: @majetideepak

@majetideepak
Copy link
Collaborator

majetideepak commented May 23, 2024

It seems Arrow's HDFS filesystem depends on some hdfs dynamic library to be present in the system and dynamically links against it. I see this PR does something similar as well.
https://github.com/apache/arrow/blob/main/cpp/src/arrow/io/hdfs_internal.cc#L114

get_potential_libhdfs_paths()
get_potential_libjvm_paths()

@majetideepak
Copy link
Collaborator

We do dynamically link against LIBHDFS3(hawk) as well. It should not hurt to support it?
We should avoid the multiple ifdefs in this PR.

@assignUser
Copy link
Collaborator

It should not hurt to support it?

Well it is unmaintained so it should'nt be the default and we should probably mark it deprecated and remove it in the future.

@majetideepak
Copy link
Collaborator

Well it is unmaintained so it should'nt be the default and we should probably mark it deprecated and remove it in the future.

Switching the default makes sense since hawq is not maintained. But I wonder if the hawq implementation provides additional benefits (say performance) compared to the Java implementation.

@JkSelf
Copy link
Collaborator Author

JkSelf commented May 27, 2024

It seems Arrow's HDFS filesystem depends on some hdfs dynamic library to be present in the system and dynamically links against it. I see this PR does something similar as well. https://github.com/apache/arrow/blob/main/cpp/src/arrow/io/hdfs_internal.cc#L114

get_potential_libhdfs_paths()
get_potential_libjvm_paths()

@majetideepak Yes, that makes sense to me. If we proceed this way, we can avoid the need to ensure that the user's system has HDFS and JVM installed during the build phase.

@JkSelf
Copy link
Collaborator Author

JkSelf commented May 27, 2024

Well it is unmaintained so it should'nt be the default and we should probably mark it deprecated and remove it in the future.

Switching the default makes sense since hawq is not maintained. But I wonder if the hawq implementation provides additional benefits (say performance) compared to the Java implementation.

  • In terms of performance, hawq implementation is generally considered to be superior to jvm libhdfs because it is designed for enhanced performance and reduces the additional overhead caused by JNI. However, functionality is equally important to consider. Currently, hawq implementation does not meet user needs in certain functional aspects, such as lacking support for Kerberos authentication and ViewFs.

  • If there is a preference to retain hawq implementation for its performance, we could offer two compilation options, VELOX_ENABLE_HDFS and VELOX_ENABLE_HDFS3, to support jvm libhdfs and hawq implementation respectively. However, maintaining both options might necessitate the use of multiple ifdef statements to determine which hdfs class to load.
    @majetideepak @assignUser What do you think? Thanks.

@assignUser
Copy link
Collaborator

In terms of performance, hawq implementation is generally considered to be superior t

While better performance is of course great, in this case it comes with the considerable drawback that the library in question has been unmaintained for several years and the parent project is currently in the process of being 'archived' as well. My vote would be to remove it entirely but I am not a user so I'll let others decide :)

@JkSelf
Copy link
Collaborator Author

JkSelf commented May 29, 2024

@assignUser Agree to remove hawa implementation entirely. @majetideepak What do you think?

@majetideepak
Copy link
Collaborator

Agree to remove hawa implementation entirely.

I am not a user as well :). Let's open a discussion on this and get input from the wider community.

@JkSelf
Copy link
Collaborator Author

JkSelf commented May 29, 2024

@assignUser @majetideepak Thank you for your response. I have initiated a poll in the community at #9969. Your participation and vote would be greatly appreciated.

CMakeLists.txt Outdated
find_library(
LIBHDFS3
NAMES libhdfs3.so libhdfs3.dylib
HINTS "${CMAKE_SOURCE_DIR}/hawq/depends/libhdfs3/_build/src/" REQUIRED)
add_definitions(-DVELOX_ENABLE_HDFS3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use add_definitions(-DVELOX_ENABLE_HDFS)?

@FelixYBW
Copy link

@JkSelf can you resolve the conflict?

@wForget
Copy link

wForget commented Sep 2, 2024

@JkSelf Thank you for your work. Is there any progress? It would be a really nice feature for me to replace with libhdfs so that we can support more file systems based on Hadoop FileSystem.

Copy link
Collaborator

@assignUser assignUser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now.

Copy link
Collaborator

@majetideepak majetideepak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JkSelf one minor comment. I will add the merge label once that is fixed.

@JkSelf
Copy link
Collaborator Author

JkSelf commented Oct 23, 2024

@majetideepak Please help to add the ready to merge label. Thanks.

@majetideepak majetideepak added the ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall label Oct 23, 2024
@facebook-github-bot
Copy link
Contributor

@pedroerp has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@pedroerp merged this pull request in 10cdf6f.

Copy link

Conbench analyzed the 1 benchmark run on commit 10cdf6fd.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

JkSelf added a commit to JkSelf/velox that referenced this pull request Oct 24, 2024
Summary:
Currently, Gluten will throw hdfs connection failures when executing queries if their HDFS system employs Kerberos authentication and Viewfs support. This is due to the fact that the existing libhdfs3 API does not support Kerberos authentication, whereas the JVM version of libhdfs is capable of invoking APIs that support Kerberos authentication. If the user's system has the `HADOOP_HOME `environment variable set, the JVM version of libhdfs will be used during the compilation of Gluten; if not set, the default libhdfs3 will be used instead.

Pull Request resolved: facebookincubator#9835

Reviewed By: xiaoxmeng, DanielHunte

Differential Revision: D64872596

Pulled By: pedroerp

fbshipit-source-id: 995ee73a9f8474f8a6467b926a56246073fee75e
JkSelf added a commit to JkSelf/velox that referenced this pull request Oct 25, 2024
Summary:
Currently, Gluten will throw hdfs connection failures when executing queries if their HDFS system employs Kerberos authentication and Viewfs support. This is due to the fact that the existing libhdfs3 API does not support Kerberos authentication, whereas the JVM version of libhdfs is capable of invoking APIs that support Kerberos authentication. If the user's system has the `HADOOP_HOME `environment variable set, the JVM version of libhdfs will be used during the compilation of Gluten; if not set, the default libhdfs3 will be used instead.

Pull Request resolved: facebookincubator#9835

Reviewed By: xiaoxmeng, DanielHunte

Differential Revision: D64872596

Pulled By: pedroerp

fbshipit-source-id: 995ee73a9f8474f8a6467b926a56246073fee75e
Comment on lines +15 to +18
velox_link_libraries(
velox_external_hdfs
PRIVATE
arrow)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why use PRIVATE?
If arrow has already installed in non-system path, and be included with CMAKE_PREFIX_PATH, velox_hdfs target can not found arrow header files.

# PRIVATE
/usr/lib64/ccache/c++ -I/root/oap-velox/. -I/root/oap-velox/velox/external/xxhash -isystem /root/oap-velox/velox -isystem /root/oap-velox/velox/external -isystem /usr/include/libdwarf -isystem /root/oap-velox/velox/external/date/velox/external  -o velox/connectors/hive/storage_adapters/hdfs/CMakeFiles/velox_hdfs.dir/HdfsReadFile.cpp.o -c /root/oap-velox/velox/connectors/hive/storage_adapters/hdfs/HdfsReadFile.cpp

# NON-PRIVATE
/usr/lib64/ccache/c++ -I/root/oap-velox/. -I/root/oap-velox/velox/external/xxhash -isystem /root/oap-velox/velox -isystem /root/oap-velox/velox/external -isystem /opt/extra_lib/arrow_ep/include -isystem /root/apache-arrow/cpp/_build/thrift_ep-install/include -isystem /usr/include/libdwarf -isystem /root/oap-velox/velox/external/date/velox/external -o velox/connectors/hive/storage_adapters/hdfs/CMakeFiles/velox_hdfs.dir/HdfsReadFile.cpp.o -c /root/oap-velox/velox/connectors/hive/storage_adapters/hdfs/HdfsReadFile.cpp

@JkSelf @PHILO-HE

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yohahaha, thanks for raising up this issue. The PRIVATE linkage will make the headers bound to target arrow not accessible to target velox_external_hdfs. If those headers are in system path, it is not an issue since compiler can always find them from system path. Otherwise, build error will occur. @JkSelf, if there is no special reason to use PRIVATE, could you change it to PUBLIC?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yohahaha @PHILO-HE Yes. I will file a pr to change to PUBLIC. Thanks.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I did not catch this. I saw the arrow headers in the implementation file and didn't check the header as I usually only include things that are used privately in the implementation and everything else in the header.

facebook-github-bot pushed a commit that referenced this pull request Nov 6, 2024
…hdfs (#11371)

Summary:
Fix #9835 (comment)

Pull Request resolved: #11371

Reviewed By: Yuhta

Differential Revision: D65420201

Pulled By: mbasmanova

fbshipit-source-id: 4bf4fc418f61c246b4113960e499e054c6b7ece8
JkSelf added a commit to JkSelf/velox that referenced this pull request Nov 14, 2024
JkSelf added a commit to JkSelf/velox that referenced this pull request Nov 15, 2024
JkSelf added a commit to JkSelf/velox that referenced this pull request Nov 18, 2024
JkSelf added a commit to JkSelf/velox that referenced this pull request Nov 18, 2024
@@ -15,7 +15,12 @@
*/
#include "velox/common/file/FileSystems.h"

namespace velox::filesystems::arrow::io::internal {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change necessary? I can see in HdfsFileSystem.cpp, it includes ArrowHdfsInternal.h, which has LibHdfsShim already.

Btw, is the namespace correct here? Should be facebook::velox::filesystems::arrow::io::internal??

facebook-github-bot pushed a commit that referenced this pull request Nov 25, 2024
…in HdfsReadFile (#11580)

Summary:
To address this [comment](#9835 (comment)), we moved the thread local parameter `folly::ThreadLocal<HdfsFile> file_` from a member parameter to a template parameter in the `HdfsReadFile::preadInternal` method. The frequent initialization of the thread local parameter in the method caused about a 5% performance degradation. We reverted the thread local parameter to a member parameter in this PR.

Pull Request resolved: #11580

Reviewed By: kgpai

Differential Revision: D66376694

Pulled By: kagamiori

fbshipit-source-id: ba639dd9bd52a81a71d2580d3af5259324a8c92a
TongWei1105 pushed a commit to TongWei1105/velox that referenced this pull request Dec 3, 2024
…in HdfsReadFile (facebookincubator#11580)

Summary:
To address this [comment](facebookincubator#9835 (comment)), we moved the thread local parameter `folly::ThreadLocal<HdfsFile> file_` from a member parameter to a template parameter in the `HdfsReadFile::preadInternal` method. The frequent initialization of the thread local parameter in the method caused about a 5% performance degradation. We reverted the thread local parameter to a member parameter in this PR.

Pull Request resolved: facebookincubator#11580

Reviewed By: kgpai

Differential Revision: D66376694

Pulled By: kagamiori

fbshipit-source-id: ba639dd9bd52a81a71d2580d3af5259324a8c92a
facebook-github-bot pushed a commit that referenced this pull request Dec 12, 2024
Summary:
After merging PR #9835, Gluten encountered the following compilation error. This PR refactors the package namespace to resolve the ambiguity issue.

```
/mnt/DP_disk3/jk/projects/gluten/cpp/velox/memory/VeloxColumnarBatch.cc:33:5: error: reference to ‘velox’ is ambiguous
   33 |     velox::memory::MemoryPool* pool,
      |     ^~~~~
In file included from /mnt/DP_disk3/jk/projects/gluten/cpp/velox/operators/writer/VeloxParquetDataSource.h:43,
                 from /mnt/DP_disk3/jk/projects/gluten/cpp/velox/compute/VeloxRuntime.h:25,
                 from /mnt/DP_disk3/jk/projects/gluten/cpp/velox/memory/VeloxColumnarBatch.cc:18:
/mnt/DP_disk3/jk/projects/gluten/dev/../ep/build-velox/build/velox_ep/velox/connectors/hive/storage_adapters/hdfs/HdfsFileSystem.h:18:11: note: candidates are: ‘namespace velox { }’
   18 | namespace velox::filesystems::arrow::io::internal {
      |           ^~~~~
In file included from /mnt/DP_disk3/jk/projects/gluten/dev/../ep/build-velox/build/velox_ep/velox/common/base/VeloxException.h:29,
                 from /mnt/DP_disk3/jk/projects/gluten/dev/../ep/build-velox/build/velox_ep/velox/common/base/Exceptions.h:27,
                 from /mnt/DP_disk3/jk/projects/gluten/dev/../ep/build-velox/build/velox_ep/velox/common/base/BitUtil.h:19,
                 from /mnt/DP_disk3/jk/projects/gluten/dev/../ep/build-velox/build/velox_ep/velox/common/caching/AsyncDataCache.h:27,
                 from /mnt/DP_disk3/jk/projects/gluten/cpp/velox/compute/VeloxBackend.h:27,
                 from /mnt/DP_disk3/jk/projects/gluten/cpp/velox/memory/VeloxMemoryManager.h:20,
                 from /mnt/DP_disk3/jk/projects/gluten/cpp/velox/memory/VeloxColumnarBatch.h:21,
                 from /mnt/DP_disk3/jk/projects/gluten/cpp/velox/memory/VeloxColumnarBatch.cc:17:
```

Pull Request resolved: #11585

Reviewed By: xiaoxmeng

Differential Revision: D67119176

Pulled By: kgpai

fbshipit-source-id: d8eab6f8570ffa7b35bdc1b890c658f6926b8313
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall
Projects
None yet
Development

Successfully merging this pull request may close these issues.