Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trilinos_PR_cuda-11.4.2-uvm-off PR build not running/submitted to CDash starting 2024-01-24 #12696

Open
bartlettroscoe opened this issue Jan 25, 2024 · 7 comments

Comments

@bartlettroscoe
Copy link
Member

bartlettroscoe commented Jan 25, 2024

CC: @trilinos/framework, @sebrowne, @achauphan

Description

As shown in this query, the Trilinos PR build Trilinos_PR_cuda-11.4.2-uvm-off has not posted full results to CDash since early yesterday (2024-01-24) showing:

image

Yet, many PR iterations have run and posted to CDash in that time as shown in this query for the Trilinos_PR_clang-11.0.1 PR build, for example, showing:

image

That is a bunch of PR that are not passing their PR test iterations and will not be getting merged. (This explains why it took so long for the autotester to run on my new PR #12695.)

Looks like this has so far impacted the PRs:

@bartlettroscoe bartlettroscoe pinned this issue Jan 25, 2024
@bartlettroscoe
Copy link
Member Author

@ccober6 and @sebrowne, this would appear to be a catastrophic failure of the PR testing system.

@achauphan
Copy link
Contributor

It appears that all PR GPU machines are down. @sebrowne has a ticket in.

@bartlettroscoe
Copy link
Member Author

It appears that all PR GPU machines are down. @sebrowne has a ticket in.

@sebrowne and @achauphan, given the importance of these PR build machines, is it possible to have some type of monitoring system for these machines so that if they go down, someone can be notified ASAP? There must be monitoring tools that can do this type of thing. (Or Jenkins should be able to do this since it is trying to run these jobs; perhaps with the right Jenkins plugin like described here?) I know all of this autotester and Jenkins stuff is going to be thrown away once Trilinos moves to GHA, but the same issues can occur with that process as well.

The problem right now is that when something goes wrong with the Trilinos infrastructure, it is Trilinos developers that have to detect and report the problem. Problems with the infrastructure will occur from time to time (that is to be expected), but when they do, it would be good if the people maintaining the infrastructure could be directly notified and not have to rely on the Trilinos developers to detect and report problems like this.

@achauphan
Copy link
Contributor

[...] it would be good if the people maintaining the infrastructure could be directly notified and not have to rely on the Trilinos developers to detect and report problems like this.

Agreed, will bring this up at our retro next week to see if there is a reasonable solution we can setup in the interim of AT2. Currently, there is an email sent by Jenkins when a node goes offline which I had missed.

@achauphan
Copy link
Contributor

As a status update, this morning all GPU nodes were brought back online. One node has been brought back offline manually as it is having very odd and poor performance and picked up the first few jobs this morning.

@bartlettroscoe
Copy link
Member Author

bartlettroscoe commented Jan 26, 2024

FYI: It is not just the CUDA build that has failed to produce PR testing results on CDash. It is also theTrilinos_PR_gcc-8.3.0-debug build rhel7_sems-gnu-8.3.0-openmpi-1.10.1-serial_debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables as well. See #12695 (comment). But there is one of these builds just started 37 minutes ago so it is not clear how serious of a problem this is.

NOTE: It would also be great to set up monitoring of CDash looking for missing PR build results as well. That is similar to looking for randomly failing tests (see TriBITSPub/TriBITS#600) but is does not require knowing the repo versions but it is complicated by the challenge of trying to group builds on CDash that are part of the same PR testing iteration (because all you have to go on is the Build Start Time which are all different but are typically within a few minutes of each other). I suggested this in this internal comment.

@ndellingwood
Copy link
Contributor

PR #12707 was has been hit with a couple issues in the gcc build mentioned above

  • rhel7_sems-gnu-8.3.0-openmpi-1.10.1-serial_debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables

as well as this build

  • rhel7_sems-gnu-8.3.0-openmpi-1.10.1-openmp_release-debug_static_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables

#12707 (comment)

That was on ascic166, the node ran out of memory

@sebrowne sebrowne unpinned this issue Mar 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants