Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Clang Linux build to CI pipeline #10767

Closed
wants to merge 1 commit into from

Conversation

czentgr
Copy link
Collaborator

@czentgr czentgr commented Aug 15, 2024

The linux build is refactored to be able to switch between clang and gcc based builds of Velox.
The gcc based linux build is run on pull and push.
The clang based linux build is added to the scheduled jobs and executed on schedule
only.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 15, 2024
Copy link

netlify bot commented Aug 15, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 5b209d8
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/673533e5e697e4000811b2c3

@czentgr czentgr force-pushed the cz_add_clang_build branch from 984a9e0 to 245f3a3 Compare August 15, 2024 22:56
@czentgr
Copy link
Collaborator Author

czentgr commented Aug 15, 2024

The workflow isn't triggered. Maybe a chicken and egg problem?

@majetideepak FYI.
@assignUser Can you please take a look? Do you have any idea on how to test it?

@assignUser
Copy link
Collaborator

@czentgr the job was triggerd but failed during parsing which means it doesn't show up in the check list (not a great choice on gh's part...) https://github.com/facebookincubator/velox/actions/runs/10411698998/workflow#L53
you'll have to use either full owner/repo@version syntax or local file syntaxt with ./... https://docs.github.com/en/actions/sharing-automations/reusing-workflows#calling-a-reusable-workflow

@czentgr czentgr force-pushed the cz_add_clang_build branch from 245f3a3 to 195045f Compare August 19, 2024 18:51
@czentgr
Copy link
Collaborator Author

czentgr commented Aug 19, 2024

@assignUser Thank you! I fixed the path. I overlooked the part with the path in the doc.

@czentgr czentgr force-pushed the cz_add_clang_build branch 11 times, most recently from a45f534 to f12993e Compare August 21, 2024 18:20
@czentgr
Copy link
Collaborator Author

czentgr commented Aug 21, 2024

@assignUser I hope it is nothing bad but got two odd issues in my recent run:

linux-adapter with clang (the one most interested in) failed with network issue (artifact read timed out).
https://github.com/facebookincubator/velox/actions/runs/10495295073/job/29073427054?pr=10767

Run gh run download $STASH_RUN_ID --name $STASH_NAME --dir "$STASH_DIR" -R "$REPO"
error downloading ccache-linux-adapters-10767_merge: error writing zip archive: read tcp 172.18.0.2:52660->20.209.227.33:443: read: connection timed out

And MacOS14 failed during upload
https://github.com/facebookincubator/velox/actions/runs/10495295084/job/29073426792?pr=10767

with

 node:events:497
      throw er; // Unhandled 'error' event
      ^

Error: EMFILE: too many open files, open '/Users/runner/work/velox/velox/.ccache/e/a/belcl8n03chag2f4bkp5u4di8puknssR'
Emitted 'error' event on ReadStream instance at:
    at emitErrorNT (node:internal/streams/destroy:169:8)
    at emitErrorCloseNT (node:internal/streams/destroy:128:3)
    at process.processTicksAndRejections (node:internal/process/task_queues:82:21) {
  errno: -24,
  code: 'EMFILE',
  syscall: 'open',
  path: '/Users/runner/work/velox/velox/.ccache/e/a/belcl8n03chag2f4bkp5u4di8puknssR'

I will retry later.

@czentgr czentgr force-pushed the cz_add_clang_build branch from f12993e to a5a4c98 Compare August 21, 2024 23:36
@assignUser
Copy link
Collaborator

I have deleted all active caches for the mac 14 build (on main), that should fix this issue.

@czentgr
Copy link
Collaborator Author

czentgr commented Aug 22, 2024

The build issue for ubuntu-clang Linux with adapters is fixed in PR #10800.

@czentgr czentgr marked this pull request as ready for review August 22, 2024 15:20
@czentgr czentgr changed the title [WIP] Add Clang Linux build to CI pipeline Add Clang Linux build to CI pipeline Aug 22, 2024
@assignUser
Copy link
Collaborator

It seems the macos14 issues is related to a bug in the artifacts action, looking into it.

@czentgr czentgr force-pushed the cz_add_clang_build branch 2 times, most recently from eee46a1 to 3ead727 Compare August 28, 2024 20:09
@czentgr
Copy link
Collaborator Author

czentgr commented Aug 29, 2024

The release build with clang15 causes one of the tests in velox_dwio_common_test to SEGV. The issue only occurs with a release build.

Running main() from /builddir/build/BUILD/googletest-release-1.11.0/googletest/src/gtest_main.cc
Note: Google Test filter = ParallelForTest.E2E
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from ParallelForTest
[ RUN      ] ParallelForTest.E2E

Program received signal SIGSEGV, Segmentation fault.
0x0000000000cfe2a0 in folly::InlineLikeExecutor::InlineLikeExecutor (this=0x10b8318 <folly::InlineExecutor::instance_slow()::instance>, __vtt_parm=0x8, __in_chrg=<optimized out>) at /root/deps/folly/folly/executors/InlineExecutor.h:27
27      class InlineLikeExecutor : public virtual Executor {
Missing separate debuginfos, use: dnf debuginfo-install cyrus-sasl-lib-2.1.27-21.el9.x86_64 double-conversion-3.1.5-6.el9.x86_64 gflags-2.2.2-9.el9.x86_64 glibc-2.34-120.el9.x86_64 gmock-1.11.0-1.el9.x86_64 gtest-1.11.0-1.el9.x86_64 keyutils-libs-1.6.3-1.el9.x86_64 krb5-libs-1.21.1-3.el9.x86_64 libatomic-11.5.0-2.el9.x86_64 libbrotli-1.0.9-6.el9.x86_64 libcom_err-1.46.5-5.el9.x86_64 libcurl-7.76.1-29.el9.x86_64 libdwarf-0.3.4-1.el9.1.x86_64 libevent-2.1.12-6.el9.x86_64 libgcc-11.5.0-2.el9.x86_64 libgsasl-1.10.0-3.el9.x86_64 libicu-67.1-9.el9.x86_64 libidn-1.38-4.el9.x86_64 libidn2-2.3.0-7.el9.x86_64 libnghttp2-1.43.0-6.el9.x86_64 libpsl-0.21.1-5.el9.x86_64 libselinux-3.6-1.el9.x86_64 libsodium-1.0.18-8.el9.x86_64 libssh-0.10.4-13.el9.x86_64 libunistring-0.9.10-15.el9.x86_64 libxml2-2.9.13-6.el9.x86_64 libzstd-1.5.1-2.el9.x86_64 lz4-libs-1.9.3-5.el9.x86_64 openssl-libs-3.2.2-2.el9.x86_64 pcre2-10.40-6.el9.x86_64 re2-20211101-20.el9.x86_64 xz-libs-5.2.5-8.el9.x86_64 zlib-1.2.11-41.el9.x86_64
(gdb) bt
#0  0x0000000000cfe2a0 in folly::InlineLikeExecutor::InlineLikeExecutor (this=0x10b8318 <folly::InlineExecutor::instance_slow()::instance>, __vtt_parm=0x8, __in_chrg=<optimized out>)
    at /root/deps/folly/folly/executors/InlineExecutor.h:27
#1  0x0000000000cfe302 in folly::InlineExecutor::InlineExecutor (this=0x10b8318 <folly::InlineExecutor::instance_slow()::instance>, __in_chrg=<optimized out>, __vtt_parm=<optimized out>)
    at /root/deps/folly/folly/executors/InlineExecutor.h:35
#2  0x0000000000cfe356 in folly::Indestructible<folly::InlineExecutor>::Storage::Storage<, folly::InlineExecutor>(std::in_place_t) (this=0x10b8318 <folly::InlineExecutor::instance_slow()::instance>)
    at /root/deps/folly/folly/Indestructible.h:152
#3  0x0000000000cfe28c in folly::Indestructible<folly::InlineExecutor>::Indestructible<folly::InlineExecutor, folly::InlineExecutor> (this=0x10b8318 <folly::InlineExecutor::instance_slow()::instance>)
    at /root/deps/folly/folly/Indestructible.h:73
#4  0x0000000000cfe1ed in folly::InlineExecutor::instance_slow () at /root/deps/folly/folly/executors/InlineExecutor.cpp:24
#5  0x0000000000952ca6 in ParallelForTest_E2E_Test::TestBody() ()
#6  0x00007ffff796ee0c in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) [clone .constprop.0] () from /lib64/libgtest.so.1.11.0
#7  0x00007ffff794f826 in testing::Test::Run() () from /lib64/libgtest.so.1.11.0
#8  0x00007ffff794f9f0 in testing::TestInfo::Run() () from /lib64/libgtest.so.1.11.0
#9  0x00007ffff794faf9 in testing::TestSuite::Run() () from /lib64/libgtest.so.1.11.0
#10 0x00007ffff795efc5 in testing::internal::UnitTestImpl::RunAllTests() () from /lib64/libgtest.so.1.11.0
#11 0x00007ffff795c7c8 in testing::UnitTest::Run() () from /lib64/libgtest.so.1.11.0
#12 0x00007ffff7fac154 in main () from /lib64/libgtest_main.so.1.11.0
#13 0x00007ffff6229590 in __libc_start_call_main () from /lib64/libc.so.6
#14 0x00007ffff6229640 in __libc_start_main_impl () from /lib64/libc.so.6
#15 0x00000000008f8395 in _start ()
(gdb) p $_siginfo
$1 = {si_signo = 11, si_errno = 0, si_code = 1, _sifields = {_pad = {8, 0 <repeats 27 times>}, _kill = {si_pid = 8, si_uid = 0}, _timer = {si_tid = 8, si_overrun = 0, si_sigval = {sival_int = 0, sival_ptr = 0x0}}, _rt = {si_pid = 8,
      si_uid = 0, si_sigval = {sival_int = 0, sival_ptr = 0x0}}, _sigchld = {si_pid = 8, si_uid = 0, si_status = 0, si_utime = 0, si_stime = 0}, _sigfault = {si_addr = 0x8, _addr_lsb = 0, _addr_bnd = {_lower = 0x0, _upper = 0x0}},
    _sigpoll = {si_band = 8, si_fd = 0}}}

A SEGV at signal address 0x8. So a nullptr that tries to access at offset 0x8.
Perhaps some optimization in clang that causes the issue from the test due to instruction reordering.

The issue occurs when calling the constructor of a static object

static Indestructible<InlineExecutor> instance;

Further investigation shows that the VTT InlineExecutor symbol for a release build is not linked.
Command: nm velox_dwio_common_test -C | grep InlineExecutor

It has no address:

                 v VTT for folly::InlineExecutor

while for a debug build it shows:

000000000185d3e0 V VTT for folly::InlineExecutor

The symbol not having an address would result in the SEGV as it results in a NULL pointer access.

I don't know why this is compiled like by Clang15 and not linked properly. It seems the symbol is optimized away. It could also have something to do with the fact that it is a "cold" symbol.
It could also be that folly needs a code change to avoid the optimization. This is a base class constructor call that cannot be made. Perhaps the InlineLikeExecutor needs an explicit constructor definition.

Because this is a test executable I've added a workaround to compile this file with no optimizations which resolves the issue. We can revisit this later - perhaps open a new issue to investigate this further at a later point in time.

@czentgr czentgr force-pushed the cz_add_clang_build branch 2 times, most recently from 3566ec6 to d0a535e Compare August 29, 2024 21:58
@czentgr czentgr force-pushed the cz_add_clang_build branch from b6272d6 to 48287b1 Compare October 29, 2024 21:56
@assignUser assignUser added the ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall label Oct 29, 2024
@assignUser assignUser requested a review from kgpai November 1, 2024 17:15
@czentgr czentgr force-pushed the cz_add_clang_build branch from 48287b1 to a0dae26 Compare November 2, 2024 03:46
@facebook-github-bot
Copy link
Contributor

@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@mbasmanova
Copy link
Contributor

@czentgr Would you rebase to resolve merge conflicts?

run: |
MINIO_BINARY="minio-2022-05-26"
if [ ! -f /usr/local/bin/${MINIO_BINARY} ]; then
wget https://dl.min.io/server/minio/release/linux-amd64/archive/minio.RELEASE.2022-05-26T05-48-41Z -O ${MINIO_BINARY}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: in a future update we should bake this into the image else we run risk of ci jobs failing when they cant download this for whatever reason..

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, this was added a while back. The entire file is a copy of the original with minor changes.

Copy link
Collaborator Author

@czentgr czentgr Nov 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Minio binary should be found in the image as it is part of the setup-adapter.sh install for AWS. As such the Minio binary should be found nowadays. In most (all?) cases this download doesn't occur.
We can clean it up with a subsequent PR.

@czentgr czentgr force-pushed the cz_add_clang_build branch from a0dae26 to 3f7c934 Compare November 4, 2024 22:25
@facebook-github-bot
Copy link
Contributor

@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@mbasmanova
Copy link
Contributor

@czentgr CI is red.

@czentgr
Copy link
Collaborator Author

czentgr commented Nov 6, 2024

@mbasmanova Yes. Two flaky tests (MockSharedArbitrationTest, FilterProjectReplayerTest) failed that seem to have been fixed recently (aka within the last 24h). I can rebase again. Does this cause an issue with the fact that you already imported it?

@mbasmanova
Copy link
Contributor

@czentgr I'm seeing merge conflict with "Add gcc11 to Ubuntu20.04 setup and add PkgConfig install". Would you rebase?

@czentgr czentgr force-pushed the cz_add_clang_build branch from 3f7c934 to c4fda34 Compare November 6, 2024 16:25
@czentgr
Copy link
Collaborator Author

czentgr commented Nov 6, 2024

@mbasmanova Odd you see a conflict. "Add gcc11 to Ubuntu20.04 setup and add PkgConfig install" did touch setup-ubuntu.sh but in different places and should not cause a conflict. I've rebased so hopefully your conflict is gone.

@facebook-github-bot
Copy link
Contributor

@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@mbasmanova
Copy link
Contributor

@czentgr There is Spark Fuzzer failure. Would you take a look?

CC: @rui-mo

@czentgr
Copy link
Collaborator Author

czentgr commented Nov 6, 2024

This change should not affect the fuzzer. I created an issue: #11462
In it there is a reference to an issue that was closed earlier with the same error. Maybe it resurfaced for some reason?

@mbasmanova
Copy link
Contributor

@czentgr Thank you for creating an issue. I'll try to by-pass this failure and merge.

@mbasmanova
Copy link
Contributor

@czentgr Something happening and land got stuck. Would you rebase so I can try again?

The linux build is refactored to be able to switch between clang and gcc based builds of Velox.
The gcc based linux build is run on pull and push.
The clang based linux build is added to the scheduled jobs and executed on schedule
only.
@czentgr
Copy link
Collaborator Author

czentgr commented Nov 13, 2024

@mbasmanova Thanks! I rebased.

@mbasmanova
Copy link
Contributor

@czentgr Thanks. There are Fuzzer failures. Would you take a look?

@facebook-github-bot
Copy link
Contributor

@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@mbasmanova merged this pull request in 8374802.

@czentgr
Copy link
Collaborator Author

czentgr commented Nov 14, 2024

@mbasmanova The fuzzer failure was some intermittend network issue. It failed to upload artifacts

With the provided path, there will be 6 files uploaded
Artifact name is valid!
Root directory input is valid!
Attempt 1 of 5 failed with error: Request timeout: /twirp/github.actions.results.api.v1.ArtifactService/CreateArtifact. Retrying request in 3000 ms...
Attempt 2 of 5 failed with error: Request timeout: /twirp/github.actions.results.api.v1.ArtifactService/CreateArtifact. Retrying request in 5792 ms...
Attempt 3 of 5 failed with error: Request timeout: /twirp/github.actions.results.api.v1.ArtifactService/CreateArtifact. Retrying request in 9578 ms...
Attempt 4 of 5 failed with error: Request timeout: /twirp/github.actions.results.api.v1.ArtifactService/CreateArtifact. Retrying request in 10563 ms...
Error: Failed to CreateArtifact: Failed to make request after 5 attempts: Request timeout: /twirp/github.actions.results.api.v1.ArtifactService/CreateArtifact

The fuzzer run itself was successful. I think we are good and you merged the PR :)

Copy link

Conbench analyzed the 1 benchmark run on commit 83748024.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants