Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Frequent (but nondeterministic) failures in test_profiling in Frontier CI #1795

Open
elliottslaughter opened this issue Nov 25, 2024 · 0 comments
Labels
Realm Issues pertaining to Realm

Comments

@elliottslaughter
Copy link
Contributor

As of master commit a97cfa564afe8050ae5c455109bc4ac774916e66, I have started seeing frequent, but nondeterministic, failures in test_profiling on Frontier CI. (Note: I don't believe this is related to that specific commit, it's just the one where I happen to see the failures begin to occur.)

Sample failures: 1, 2, 3, 4, 5

Sample success: 1

Failure output looks like:

 58/109 Test  #58: test_profiling ...................***Failed   13.73 sec
top level task - getting machine and list of CPUs
[0 - 7fffe588e340]    0.159977 {3}{app}: profiling response task on processor 1d00000000000001
got profiling response - 136 bytes
Bytes: 04 00 00 00 00 00 00 00 03 00 00 00 04 00 00 00 13 00 00 00 28 00 00 00 38 00 00 00 60 00 00 00 68 00 00 00 78 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 99 49 8c 03 00 00 00 00 33 4d 8c 03 00 00 00 00 5e a2 8d 03 00 00 00 00 6d 7c 85 09 00 00 00 00 f6 cd 87 09 00 00 00 00 00 00 00 00 00 00 00 00 1c c2 8d 03 00 00 00 00 00 00 00 00 00 00 00 80 01 00 00 00 00 00 00 00 02 00 00 00 00 03 00 1d
op status = 0 (code = 0, details = 0 bytes)
op timeline = 59526451 59613790 159743085 159895030 (87339 100129295 151945)
op gpu timeline = 59621916 -9223372036854775808 (9223372036795153892)
test_profiling: /lustre/orion/ums036/proj-shared/ci/38148_79942/test/realm/test_profiling.cc:196: void response_task(const void*, size_t, const void*, size_t, Realm::Processor): Assertion `(result == OperationStatus::TERMINATED_EARLY) || (op_timeline->start_time <= op_timeline->end_time)' failed.
Signal 6 received by node 0, process 2489157 (thread 7fffe588e340) - obtaining backtrace
Signal 6 received by process 2489157 (thread 7fffe588e340) at: stack trace: 11 frames
  [0] = /lib64/libc.so.6(gsignal+0x10d) [0x7ffff3c5ed2b]
  [1] = /lib64/libc.so.6(abort+0x176) [0x7ffff3c603e4]
  [2] = /lib64/libc.so.6(+0x42c69) [0x7ffff3c56c69]
  [3] = /lib64/libc.so.6(__assert_fail+0x43) [0x7ffff3c56cf1]
  [4] = /lustre/orion/ums036/proj-shared/ci/38148_79942/tmpuoldqey0/build/bin/test_profiling(response_task(void const*, unsigned long, void const*, unsigned long, Realm::Processor)+0xe8e) [0x48aaae]
  [5] = /lustre/orion/ums036/proj-shared/ci/38148_79942/tmpuoldqey0/build/bin/test_profiling() [0x68dc18]
  [6] = /lustre/orion/ums036/proj-shared/ci/38148_79942/tmpuoldqey0/build/bin/test_profiling() [0x68dcb5]
  [7] = /lustre/orion/ums036/proj-shared/ci/38148_79942/tmpuoldqey0/build/bin/test_profiling() [0x68c12e]
  [8] = /lustre/orion/ums036/proj-shared/ci/38148_79942/tmpuoldqey0/build/bin/test_profiling() [0x6930e9]
  [9] = /lib64/libc.so.6(+0x6179d) [0x7ffff3c7579d]
  [10] = [(nil)]

This is now breaking all of our Frontier CI builds, so it would be nice if we could get this fixed.

@eddy16112 @apryakhin

@eddy16112 eddy16112 added the Realm Issues pertaining to Realm label Nov 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Realm Issues pertaining to Realm
Projects
None yet
Development

No branches or pull requests

2 participants