-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support jvm version libhdfs in velox #9835
Conversation
✅ Deploy Preview for meta-velox canceled.
|
@mbasmanova @majetideepak Can you help to review? Thanks. |
fbb5763
to
a33070c
Compare
@JkSelf CI is red. Please, take a look. |
cc: @majetideepak |
It seems Arrow's HDFS filesystem depends on some hdfs dynamic library to be present in the system and dynamically links against it. I see this PR does something similar as well.
|
We do dynamically link against LIBHDFS3(hawk) as well. It should not hurt to support it? |
Well it is unmaintained so it should'nt be the default and we should probably mark it deprecated and remove it in the future. |
Switching the default makes sense since hawq is not maintained. But I wonder if the hawq implementation provides additional benefits (say performance) compared to the Java implementation. |
@majetideepak Yes, that makes sense to me. If we proceed this way, we can avoid the need to ensure that the user's system has HDFS and JVM installed during the build phase. |
|
While better performance is of course great, in this case it comes with the considerable drawback that the library in question has been unmaintained for several years and the parent project is currently in the process of being 'archived' as well. My vote would be to remove it entirely but I am not a user so I'll let others decide :) |
@assignUser Agree to remove hawa implementation entirely. @majetideepak What do you think? |
I am not a user as well :). Let's open a discussion on this and get input from the wider community. |
@assignUser @majetideepak Thank you for your response. I have initiated a poll in the community at #9969. Your participation and vote would be greatly appreciated. |
CMakeLists.txt
Outdated
find_library( | ||
LIBHDFS3 | ||
NAMES libhdfs3.so libhdfs3.dylib | ||
HINTS "${CMAKE_SOURCE_DIR}/hawq/depends/libhdfs3/_build/src/" REQUIRED) | ||
add_definitions(-DVELOX_ENABLE_HDFS3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use add_definitions(-DVELOX_ENABLE_HDFS)?
@JkSelf can you resolve the conflict? |
@JkSelf Thank you for your work. Is there any progress? It would be a really nice feature for me to replace with libhdfs so that we can support more file systems based on Hadoop FileSystem. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JkSelf one minor comment. I will add the merge label once that is fixed.
velox/connectors/hive/storage_adapters/hdfs/tests/HdfsFileSystemTest.cpp
Outdated
Show resolved
Hide resolved
@majetideepak Please help to add the ready to merge label. Thanks. |
@pedroerp has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Conbench analyzed the 1 benchmark run on commit There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
Summary: Currently, Gluten will throw hdfs connection failures when executing queries if their HDFS system employs Kerberos authentication and Viewfs support. This is due to the fact that the existing libhdfs3 API does not support Kerberos authentication, whereas the JVM version of libhdfs is capable of invoking APIs that support Kerberos authentication. If the user's system has the `HADOOP_HOME `environment variable set, the JVM version of libhdfs will be used during the compilation of Gluten; if not set, the default libhdfs3 will be used instead. Pull Request resolved: facebookincubator#9835 Reviewed By: xiaoxmeng, DanielHunte Differential Revision: D64872596 Pulled By: pedroerp fbshipit-source-id: 995ee73a9f8474f8a6467b926a56246073fee75e
Summary: Currently, Gluten will throw hdfs connection failures when executing queries if their HDFS system employs Kerberos authentication and Viewfs support. This is due to the fact that the existing libhdfs3 API does not support Kerberos authentication, whereas the JVM version of libhdfs is capable of invoking APIs that support Kerberos authentication. If the user's system has the `HADOOP_HOME `environment variable set, the JVM version of libhdfs will be used during the compilation of Gluten; if not set, the default libhdfs3 will be used instead. Pull Request resolved: facebookincubator#9835 Reviewed By: xiaoxmeng, DanielHunte Differential Revision: D64872596 Pulled By: pedroerp fbshipit-source-id: 995ee73a9f8474f8a6467b926a56246073fee75e
velox_link_libraries( | ||
velox_external_hdfs | ||
PRIVATE | ||
arrow) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why use PRIVATE
?
If arrow has already installed in non-system path, and be included with CMAKE_PREFIX_PATH
, velox_hdfs
target can not found arrow header files.
# PRIVATE
/usr/lib64/ccache/c++ -I/root/oap-velox/. -I/root/oap-velox/velox/external/xxhash -isystem /root/oap-velox/velox -isystem /root/oap-velox/velox/external -isystem /usr/include/libdwarf -isystem /root/oap-velox/velox/external/date/velox/external -o velox/connectors/hive/storage_adapters/hdfs/CMakeFiles/velox_hdfs.dir/HdfsReadFile.cpp.o -c /root/oap-velox/velox/connectors/hive/storage_adapters/hdfs/HdfsReadFile.cpp
# NON-PRIVATE
/usr/lib64/ccache/c++ -I/root/oap-velox/. -I/root/oap-velox/velox/external/xxhash -isystem /root/oap-velox/velox -isystem /root/oap-velox/velox/external -isystem /opt/extra_lib/arrow_ep/include -isystem /root/apache-arrow/cpp/_build/thrift_ep-install/include -isystem /usr/include/libdwarf -isystem /root/oap-velox/velox/external/date/velox/external -o velox/connectors/hive/storage_adapters/hdfs/CMakeFiles/velox_hdfs.dir/HdfsReadFile.cpp.o -c /root/oap-velox/velox/connectors/hive/storage_adapters/hdfs/HdfsReadFile.cpp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Yohahaha, thanks for raising up this issue. The PRIVATE linkage will make the headers bound to target arrow not accessible to target velox_external_hdfs
. If those headers are in system path, it is not an issue since compiler can always find them from system path. Otherwise, build error will occur. @JkSelf, if there is no special reason to use PRIVATE, could you change it to PUBLIC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I did not catch this. I saw the arrow headers in the implementation file and didn't check the header as I usually only include things that are used privately in the implementation and everything else in the header.
…hdfs (#11371) Summary: Fix #9835 (comment) Pull Request resolved: #11371 Reviewed By: Yuhta Differential Revision: D65420201 Pulled By: mbasmanova fbshipit-source-id: 4bf4fc418f61c246b4113960e499e054c6b7ece8
This reverts commit 10cdf6f.
This reverts commit 10cdf6f.
This reverts commit 10cdf6f.
This reverts commit 10cdf6f.
@@ -15,7 +15,12 @@ | |||
*/ | |||
#include "velox/common/file/FileSystems.h" | |||
|
|||
namespace velox::filesystems::arrow::io::internal { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change necessary? I can see in HdfsFileSystem.cpp
, it includes ArrowHdfsInternal.h
, which has LibHdfsShim
already.
Btw, is the namespace correct here? Should be facebook::velox::filesystems::arrow::io::internal
??
…in HdfsReadFile (#11580) Summary: To address this [comment](#9835 (comment)), we moved the thread local parameter `folly::ThreadLocal<HdfsFile> file_` from a member parameter to a template parameter in the `HdfsReadFile::preadInternal` method. The frequent initialization of the thread local parameter in the method caused about a 5% performance degradation. We reverted the thread local parameter to a member parameter in this PR. Pull Request resolved: #11580 Reviewed By: kgpai Differential Revision: D66376694 Pulled By: kagamiori fbshipit-source-id: ba639dd9bd52a81a71d2580d3af5259324a8c92a
…in HdfsReadFile (facebookincubator#11580) Summary: To address this [comment](facebookincubator#9835 (comment)), we moved the thread local parameter `folly::ThreadLocal<HdfsFile> file_` from a member parameter to a template parameter in the `HdfsReadFile::preadInternal` method. The frequent initialization of the thread local parameter in the method caused about a 5% performance degradation. We reverted the thread local parameter to a member parameter in this PR. Pull Request resolved: facebookincubator#11580 Reviewed By: kgpai Differential Revision: D66376694 Pulled By: kagamiori fbshipit-source-id: ba639dd9bd52a81a71d2580d3af5259324a8c92a
Summary: After merging PR #9835, Gluten encountered the following compilation error. This PR refactors the package namespace to resolve the ambiguity issue. ``` /mnt/DP_disk3/jk/projects/gluten/cpp/velox/memory/VeloxColumnarBatch.cc:33:5: error: reference to ‘velox’ is ambiguous 33 | velox::memory::MemoryPool* pool, | ^~~~~ In file included from /mnt/DP_disk3/jk/projects/gluten/cpp/velox/operators/writer/VeloxParquetDataSource.h:43, from /mnt/DP_disk3/jk/projects/gluten/cpp/velox/compute/VeloxRuntime.h:25, from /mnt/DP_disk3/jk/projects/gluten/cpp/velox/memory/VeloxColumnarBatch.cc:18: /mnt/DP_disk3/jk/projects/gluten/dev/../ep/build-velox/build/velox_ep/velox/connectors/hive/storage_adapters/hdfs/HdfsFileSystem.h:18:11: note: candidates are: ‘namespace velox { }’ 18 | namespace velox::filesystems::arrow::io::internal { | ^~~~~ In file included from /mnt/DP_disk3/jk/projects/gluten/dev/../ep/build-velox/build/velox_ep/velox/common/base/VeloxException.h:29, from /mnt/DP_disk3/jk/projects/gluten/dev/../ep/build-velox/build/velox_ep/velox/common/base/Exceptions.h:27, from /mnt/DP_disk3/jk/projects/gluten/dev/../ep/build-velox/build/velox_ep/velox/common/base/BitUtil.h:19, from /mnt/DP_disk3/jk/projects/gluten/dev/../ep/build-velox/build/velox_ep/velox/common/caching/AsyncDataCache.h:27, from /mnt/DP_disk3/jk/projects/gluten/cpp/velox/compute/VeloxBackend.h:27, from /mnt/DP_disk3/jk/projects/gluten/cpp/velox/memory/VeloxMemoryManager.h:20, from /mnt/DP_disk3/jk/projects/gluten/cpp/velox/memory/VeloxColumnarBatch.h:21, from /mnt/DP_disk3/jk/projects/gluten/cpp/velox/memory/VeloxColumnarBatch.cc:17: ``` Pull Request resolved: #11585 Reviewed By: xiaoxmeng Differential Revision: D67119176 Pulled By: kgpai fbshipit-source-id: d8eab6f8570ffa7b35bdc1b890c658f6926b8313
Currently, Gluten will throw hdfs connection failures when executing queries if their HDFS system employs Kerberos authentication and Viewfs support. This is due to the fact that the existing libhdfs3 API does not support Kerberos authentication, whereas the JVM version of libhdfs is capable of invoking APIs that support Kerberos authentication. If the user's system has the
HADOOP_HOME
environment variable set, the JVM version of libhdfs will be used during the compilation of Gluten; if not set, the default libhdfs3 will be used instead.