Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-35089: [CI][C++][Flight] Test failures in macos release verification nightlies #35090

Merged
merged 5 commits into from
Apr 16, 2023

Conversation

zeroshade
Copy link
Member

@zeroshade zeroshade commented Apr 12, 2023

@zeroshade zeroshade requested review from raulcd and assignUser April 12, 2023 22:13
@zeroshade zeroshade requested a review from lidavidm as a code owner April 12, 2023 22:13
@github-actions
Copy link

@github-actions
Copy link

⚠️ GitHub issue #35089 has been automatically assigned in GitHub to PR creator.

@zeroshade
Copy link
Member Author

@github-actions crossbow submit verify-rc-source-cpp-macos-*

@github-actions github-actions bot added the awaiting committer review Awaiting committer review label Apr 12, 2023
@github-actions
Copy link

Revision: 01d81d6

Submitted crossbow builds: ursacomputing/crossbow @ actions-6610cb8aee

Task Status
verify-rc-source-cpp-macos-amd64 Github Actions
verify-rc-source-cpp-macos-arm64 Github Actions
verify-rc-source-cpp-macos-conda-amd64 Github Actions

@zeroshade
Copy link
Member Author

@github-actions crossbow submit verify-rc-source-cpp-macos-arm64

@github-actions
Copy link

Revision: 76d6522

Submitted crossbow builds: ursacomputing/crossbow @ actions-73cf368dff

Task Status
verify-rc-source-cpp-macos-arm64 Github Actions

cpp/src/arrow/flight/flight_test.cc Outdated Show resolved Hide resolved
cpp/src/arrow/flight/test_definitions.cc Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Apr 12, 2023
@zeroshade zeroshade force-pushed the macos-release-verification branch from 76d6522 to 3ab5a0f Compare April 13, 2023 18:39
@zeroshade
Copy link
Member Author

@github-actions crossbow submit verify-rc-source-cpp-macos-*

@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Apr 13, 2023
@github-actions
Copy link

Revision: 3ab5a0f

Submitted crossbow builds: ursacomputing/crossbow @ actions-d506a04ac8

Task Status
verify-rc-source-cpp-macos-amd64 Github Actions
verify-rc-source-cpp-macos-arm64 Github Actions
verify-rc-source-cpp-macos-conda-amd64 Github Actions

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Apr 13, 2023
@zeroshade
Copy link
Member Author

I've reduced the failures at least, but I can't seem to figure out the cause of these Segfaults in the macos-amd64-conda release verification. Any assistance here would be amazing. Thanks!

@zeroshade zeroshade added the Priority: Blocker Marks a blocker for the release label Apr 13, 2023
@lidavidm
Copy link
Member

Weston spun up a macOS environment that I plan to look into tonight

@lidavidm
Copy link
Member

However we can/should get this merged

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Apr 14, 2023
@lidavidm
Copy link
Member

Ok, I checked out the instance that Weston set up. (Thanks a lot Weston!)

The test fails probabilistically if I just try to run GrpcDataTest.TestDoGetInts. The backtrace shows this suspicious behavior: The main thread is executing a gRPC-internal destructor, ExecCtx::~ExecCtx:

  thread #1, queue = 'com.apple.main-thread'
    frame #0: 0x00007ff800bcb889 libsystem_pthread.dylib`pthread_mutex_lock + 11
    frame #1: 0x00000001002af7b9 libgpr.29.dylib`gpr_mu_lock + 9
    frame #2: 0x000000010123390f libgrpc.29.dylib`fd_orphan(grpc_fd*, grpc_closure*, int*, char const*) + 255
    frame #3: 0x0000000101249487 libgrpc.29.dylib`deactivated_all_ports(grpc_tcp_server*) + 135
    frame #4: 0x0000000101249195 libgrpc.29.dylib`on_read(void*, absl::lts_20230125::Status) + 1477
    frame #5: 0x000000010123794c libgrpc.29.dylib`grpc_core::ExecCtx::Flush() + 156
    frame #6: 0x0000000101034f11 libgrpc.29.dylib`grpc_core::ExecCtx::~ExecCtx() + 33
    frame #7: 0x00000001012e9110 libgrpc.29.dylib`grpc_server_shutdown_and_notify + 288
    frame #8: 0x0000000100a441bc libgrpc++.1.51.dylib`grpc::Server::ShutdownInternal(gpr_timespec) + 236
    frame #9: 0x00000001008d249f libarrow_flight.1200.dylib`arrow::flight::transport::grpc::(anonymous namespace)::GrpcServerTransport::Shutdown() + 47

Meanwhile, one of the gRPC server threads is executing the same method:

* thread #19, stop reason = EXC_BAD_ACCESS (code=1, address=0x9)
  * frame #0: 0x0000000101222f64 libgrpc.29.dylib`grpc_core::StatusGetChildren(absl::lts_20230125::Status) + 68
    frame #1: 0x00000001012eeb8d libgrpc.29.dylib`grpc_error_has_clear_grpc_status(absl::lts_20230125::Status) + 125
    frame #2: 0x000000010113142c libgrpc.29.dylib`close_transport_locked(grpc_chttp2_transport*, absl::lts_20230125::Status) + 92
    frame #3: 0x0000000101135173 libgrpc.29.dylib`read_action_locked(void*, absl::lts_20230125::Status) + 1491
    frame #4: 0x0000000101231afe libgrpc.29.dylib`grpc_combiner_continue_exec_ctx() + 158
    frame #5: 0x00000001012378fe libgrpc.29.dylib`grpc_core::ExecCtx::Flush() + 78
    frame #6: 0x0000000101234a06 libgrpc.29.dylib`pollset_work(grpc_pollset*, grpc_pollset_worker**, grpc_core::Timestamp) + 2662
    frame #7: 0x0000000101236f86 libgrpc.29.dylib`pollset_work(grpc_pollset*, grpc_pollset_worker**, grpc_core::Timestamp) + 22
    frame #8: 0x000000010123ad23 libgrpc.29.dylib`grpc_pollset_work(grpc_pollset*, grpc_pollset_worker**, grpc_core::Timestamp) + 19
    frame #9: 0x00000001012dedb0 libgrpc.29.dylib`cq_next(grpc_completion_queue*, gpr_timespec, void*) + 544
    frame #10: 0x0000000100a29b00 libgrpc++.1.51.dylib`grpc::CompletionQueue::AsyncNextInternal(void**, bool*, gpr_timespec) + 80
    frame #11: 0x0000000100a45655 libgrpc++.1.51.dylib`grpc::Server::SyncRequestThreadManager::PollForWork(void**, bool*) + 101
    frame #12: 0x0000000100a50700 libgrpc++.1.51.dylib`grpc::ThreadManager::MainWorkLoop() + 64

That said, this is supposed to be a thread local, so I don't see how they'd trample each other here. I also can't get lldb to print out the thread-local value so I can't check if it's initialized or not.

@lidavidm
Copy link
Member

Ah, this gets lldb to recognize it:

(lldb) image lookup -v -s grpc_core::ExecCtx::exec_ctx_ libgrpc.29.dylib
1 symbols match 'grpc_core::ExecCtx::exec_ctx_' in /private/tmp/verify-tmp/mambaforge/envs/conda-source/lib/libgrpc.29.dylib:
        Address: libgrpc.29.dylib[0x00000000003db9b8] (libgrpc.29.dylib.__DATA.__thread_vars + 264)
        Summary: libgrpc.29.dylib`grpc_core::ExecCtx::exec_ctx_
         Module: file = "/private/tmp/verify-tmp/mambaforge/envs/conda-source/lib/libgrpc.29.dylib", arch = "x86_64"
         Symbol: id = {0x000036f2}, range = [0x00000001014019b8-0x00000001014019d0), name="grpc_core::ExecCtx::exec_ctx_", mangled="_ZN9grpc_core7ExecCtx9exec_ctx_E"

(lldb) print _ZN9grpc_core7ExecCtx9exec_ctx_E
(void *) $0 = 0x00007ff800bd743c

lldb says they're the same:

(lldb) t 1
* thread #1, queue = 'com.apple.main-thread'
    frame #0: 0x00007ff800bcb889 libsystem_pthread.dylib`pthread_mutex_lock + 11
libsystem_pthread.dylib`pthread_mutex_lock:
->  0x7ff800bcb889 <+11>: jne    0x7ff800bcb8fe            ; <+128>
    0x7ff800bcb88b <+13>: movl   0xc(%rdi), %eax
    0x7ff800bcb88e <+16>: movl   %eax, %ecx
    0x7ff800bcb890 <+18>: andl   $0x1c0, %ecx              ; imm = 0x1C0
(lldb) print _ZN9grpc_core7ExecCtx9exec_ctx_E
(void *) $1 = 0x00007ff800bd743c

But I wonder if it recognizes thread locals properly.

That said, this seems to mostly be a gRPC problem...

@lidavidm
Copy link
Member

I do see that grpc-cpp is 1.51.1 in the env and in the crashing CI build. (The brew CI build fails, but doesn't crash.) But conda-forge has 1.52 now - maybe we should try that?

@lidavidm
Copy link
Member

Aha, and we were artificially on an old version because the conda package was renamed grpc-cpp to libgrpc.

@lidavidm
Copy link
Member

Unfortunately, gRPC 1.52.1 has the same issue. Illegal keepalive throttling value also starts appearing. The stack traces look the same.

@lidavidm
Copy link
Member

Correction: the main thread has a different stack trace, so it seems less likely that two threads are trampling on ExecCtx and more that there's something being tickled in gRPC:

* thread #1, queue = 'com.apple.main-thread'
  * frame #0: 0x0000000101c4c760 libgrpc.30.dylib`grpc_tcp_server_unref(grpc_tcp_server*)
    frame #1: 0x0000000101b2205e libgrpc.30.dylib`grpc_core::(anonymous namespace)::Chttp2ServerListener::Orphan() + 286
    frame #2: 0x0000000101ceb844 libgrpc.30.dylib`grpc_core::Server::ShutdownAndNotify(grpc_completion_queue*, void*) + 820
    frame #3: 0x0000000101cee7e4 libgrpc.30.dylib`grpc_server_shutdown_and_notify + 276
    frame #4: 0x0000000100daa9cc libgrpc++.1.52.dylib`grpc::Server::ShutdownInternal(gpr_timespec) + 220
    frame #5: 0x00000001012706d1 libarrow_flight.1200.dylib`arrow::flight::transport::grpc::(anonymous namespace)::GrpcServerTransport::Shutdown() + 33
    frame #6: 0x0000000101250755 libarrow_flight.1200.dylib`arrow::flight::FlightServerBase::Shutdown(std::__1::chrono::time_point<std::__1::chrono::system_clock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000l> > > const*) + 149
    frame #7: 0x0000000100396131 libarrow_flight_testing.1200.dylib`arrow::flight::DataTest::TearDownTest() + 129

@lidavidm
Copy link
Member

I suppose:

  • is disabling Flight in this configuration for the release an option?
  • next step would be to try to build gRPC with debug info and see if we can debug it more thoroughly (but I'm not sure debugging gRPC is great...)
  • another step would be to pin an old gRPC version and hope that suffices
  • we could try to play stupid games (e.g., insert a sleep for macOS and hope that gives gRPC enough time to settle down and avoid whatever apparent race condition, or have server shutdown be a no-op in macOS and hope that's enough)
  • do we know when this first started happening? Is there a potential environment/dependency change or commit that we can use to start investigating?
  • I posted this in the gRPC gitter chat to see if they had any ideas

@zeroshade
Copy link
Member Author

@lidavidm When I followed the crossbow builds back I can't see an environmental change, the last successful commit before this started failing was f2d632e...1d74483 on Feb 20th (which is before the last update to grpc-cpp in conda).

I'm currently building grpc from source on the macos instance that @westonpace spun up to compare against the conda env. If i'm able to figure out anything I'll comment back on here.

@zeroshade
Copy link
Member Author

@lidavidm I was able to reproduce these crashes by building grpc v1.51.1 from source on the macos box that @westonpace spun up while keeping the rest of the dependencies coming from Conda.

In addition, weston found this: conda-forge/grpc-cpp-feedstock#281 which could likely explain the issue (assuming we're not missing something). I'm currently building v1.54.0 from source to confirm for myself that I didn't screw any environmental stuff up, but it looks like this explains the problem. I can try pinning conda to v1.50.0 or something instead and seeing if that alleviates the issue afterwards.

@zeroshade zeroshade requested a review from kou as a code owner April 14, 2023 21:52
@zeroshade
Copy link
Member Author

@github-actions crossbow submit verify-rc-source-cpp-macos-conda-amd64

@github-actions
Copy link

Revision: 4ef0c66

Submitted crossbow builds: ursacomputing/crossbow @ actions-18a35681a6

Task Status
verify-rc-source-cpp-macos-conda-amd64 Github Actions

@zeroshade
Copy link
Member Author

Pinning the conda version of grpc-cpp to <=1.50.1 looks like it worked on the macosx ec2 box. So here's hoping it works for crossbow also, if so we can call this done and set! 😄

@lidavidm
Copy link
Member

@lidavidm I was able to reproduce these crashes by building grpc v1.51.1 from source on the macos box that @westonpace spun up while keeping the rest of the dependencies coming from Conda.

In addition, weston found this: conda-forge/grpc-cpp-feedstock#281 which could likely explain the issue (assuming we're not missing something). I'm currently building v1.54.0 from source to confirm for myself that I didn't screw any environmental stuff up, but it looks like this explains the problem. I can try pinning conda to v1.50.0 or something instead and seeing if that alleviates the issue afterwards.

Wow, that's an incredible find. Thanks for figuring this out!

@assignUser
Copy link
Member

Wow, amazing job!

@zeroshade zeroshade merged commit 2daa0c3 into apache:main Apr 16, 2023
@zeroshade zeroshade deleted the macos-release-verification branch April 16, 2023 17:37
raulcd pushed a commit that referenced this pull request Apr 17, 2023
…on nightlies (#35090)

* Closes: #35089

Authored-by: Matt Topol <[email protected]>
Signed-off-by: Matthew Topol <[email protected]>
@ursabot
Copy link

ursabot commented Apr 18, 2023

Benchmark runs are scheduled for baseline = 3ff3cc8 and contender = 2daa0c3. 2daa0c3 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed] test-mac-arm
[Finished ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.12% ⬆️0.09%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 2daa0c3e ec2-t3-xlarge-us-east-2
[Failed] 2daa0c3e test-mac-arm
[Finished] 2daa0c3e ursa-i9-9960x
[Finished] 2daa0c3e ursa-thinkcentre-m75q
[Finished] 3ff3cc85 ec2-t3-xlarge-us-east-2
[Failed] 3ff3cc85 test-mac-arm
[Finished] 3ff3cc85 ursa-i9-9960x
[Finished] 3ff3cc85 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

liujiacheng777 pushed a commit to LoongArch-Python/arrow that referenced this pull request May 11, 2023
…fication nightlies (apache#35090)

* Closes: apache#35089

Authored-by: Matt Topol <[email protected]>
Signed-off-by: Matthew Topol <[email protected]>
ArgusLi pushed a commit to Bit-Quill/arrow that referenced this pull request May 15, 2023
…fication nightlies (apache#35090)

* Closes: apache#35089

Authored-by: Matt Topol <[email protected]>
Signed-off-by: Matthew Topol <[email protected]>
rtpsw pushed a commit to rtpsw/arrow that referenced this pull request May 16, 2023
…fication nightlies (apache#35090)

* Closes: apache#35089

Authored-by: Matt Topol <[email protected]>
Signed-off-by: Matthew Topol <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CI][C++][Flight] Test failures in macos release verification nightlies
4 participants