Trilinos_PR_cuda-11.4.2-uvm-off PR build not running/submitted to CDash starting 2024-01-24 #12696

bartlettroscoe · 2024-01-25T22:11:45Z

CC: @trilinos/framework, @sebrowne, @achauphan

Description

As shown in this query, the Trilinos PR build Trilinos_PR_cuda-11.4.2-uvm-off has not posted full results to CDash since early yesterday (2024-01-24) showing:

Yet, many PR iterations have run and posted to CDash in that time as shown in this query for the Trilinos_PR_clang-11.0.1 PR build, for example, showing:

That is a bunch of PR that are not passing their PR test iterations and will not be getting merged. (This explains why it took so long for the autotester to run on my new PR #12695.)

Looks like this has so far impacted the PRs:

The text was updated successfully, but these errors were encountered:

bartlettroscoe · 2024-01-25T22:13:42Z

@ccober6 and @sebrowne, this would appear to be a catastrophic failure of the PR testing system.

achauphan · 2024-01-25T23:29:23Z

It appears that all PR GPU machines are down. @sebrowne has a ticket in.

bartlettroscoe · 2024-01-25T23:40:22Z

It appears that all PR GPU machines are down. @sebrowne has a ticket in.

@sebrowne and @achauphan, given the importance of these PR build machines, is it possible to have some type of monitoring system for these machines so that if they go down, someone can be notified ASAP? There must be monitoring tools that can do this type of thing. (Or Jenkins should be able to do this since it is trying to run these jobs; perhaps with the right Jenkins plugin like described here?) I know all of this autotester and Jenkins stuff is going to be thrown away once Trilinos moves to GHA, but the same issues can occur with that process as well.

The problem right now is that when something goes wrong with the Trilinos infrastructure, it is Trilinos developers that have to detect and report the problem. Problems with the infrastructure will occur from time to time (that is to be expected), but when they do, it would be good if the people maintaining the infrastructure could be directly notified and not have to rely on the Trilinos developers to detect and report problems like this.

achauphan · 2024-01-26T17:19:56Z

[...] it would be good if the people maintaining the infrastructure could be directly notified and not have to rely on the Trilinos developers to detect and report problems like this.

Agreed, will bring this up at our retro next week to see if there is a reasonable solution we can setup in the interim of AT2. Currently, there is an email sent by Jenkins when a node goes offline which I had missed.

achauphan · 2024-01-26T18:18:01Z

As a status update, this morning all GPU nodes were brought back online. One node has been brought back offline manually as it is having very odd and poor performance and picked up the first few jobs this morning.

bartlettroscoe · 2024-01-26T22:23:57Z

FYI: It is not just the CUDA build that has failed to produce PR testing results on CDash. It is also theTrilinos_PR_gcc-8.3.0-debug build rhel7_sems-gnu-8.3.0-openmpi-1.10.1-serial_debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables as well. See #12695 (comment). But there is one of these builds just started 37 minutes ago so it is not clear how serious of a problem this is.

NOTE: It would also be great to set up monitoring of CDash looking for missing PR build results as well. That is similar to looking for randomly failing tests (see TriBITSPub/TriBITS#600) but is does not require knowing the repo versions but it is complicated by the challenge of trying to group builds on CDash that are part of the same PR testing iteration (because all you have to go on is the Build Start Time which are all different but are typically within a few minutes of each other). I suggested this in this internal comment.

ndellingwood · 2024-01-31T21:16:51Z

PR #12707 was has been hit with a couple issues in the gcc build mentioned above

rhel7_sems-gnu-8.3.0-openmpi-1.10.1-serial_debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables

as well as this build

rhel7_sems-gnu-8.3.0-openmpi-1.10.1-openmp_release-debug_static_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables

#12707 (comment)

That was on ascic166, the node ran out of memory

bartlettroscoe pinned this issue Jan 25, 2024

This was referenced Jan 25, 2024

Update TriBITS snapshot 2024-01-24 and set Trilinos_SHOW_GIT_COMMIT_PARENTS=ON for PR builds #12695

Merged

STK: Implement Reporter without virtual #12584

Merged

This was referenced Jan 26, 2024

tpetra: update spadd_* api for KOKKOSKERNELS_VERSION >= 40299 #12688

Merged

Kokkos: Don't let Kokkos set CMAKE_CXX_FLAGS #12572

Closed

sebrowne unpinned this issue Mar 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trilinos_PR_cuda-11.4.2-uvm-off PR build not running/submitted to CDash starting 2024-01-24 #12696

Trilinos_PR_cuda-11.4.2-uvm-off PR build not running/submitted to CDash starting 2024-01-24 #12696

bartlettroscoe commented Jan 25, 2024 •

edited

Loading

bartlettroscoe commented Jan 25, 2024

achauphan commented Jan 25, 2024

bartlettroscoe commented Jan 25, 2024

achauphan commented Jan 26, 2024

achauphan commented Jan 26, 2024

bartlettroscoe commented Jan 26, 2024 •

edited

Loading

ndellingwood commented Jan 31, 2024

Trilinos_PR_cuda-11.4.2-uvm-off PR build not running/submitted to CDash starting 2024-01-24 #12696

Trilinos_PR_cuda-11.4.2-uvm-off PR build not running/submitted to CDash starting 2024-01-24 #12696

Comments

bartlettroscoe commented Jan 25, 2024 • edited Loading

Description

bartlettroscoe commented Jan 25, 2024

achauphan commented Jan 25, 2024

bartlettroscoe commented Jan 25, 2024

achauphan commented Jan 26, 2024

achauphan commented Jan 26, 2024

bartlettroscoe commented Jan 26, 2024 • edited Loading

ndellingwood commented Jan 31, 2024

bartlettroscoe commented Jan 25, 2024 •

edited

Loading

bartlettroscoe commented Jan 26, 2024 •

edited

Loading