Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stat-core-merger stuck communicating with gdb #35

Open
roblatham00 opened this issue Feb 17, 2022 · 5 comments
Open

stat-core-merger stuck communicating with gdb #35

roblatham00 opened this issue Feb 17, 2022 · 5 comments

Comments

@roblatham00
Copy link
Contributor

Platform: OLCF Summit
Versions: STAT from spack: spack install stat%[email protected] cxxflags=--std=c++14

==> 1 installed package
-- linux-rhel8-power9le / [email protected] ----------------------------
wn2frxd [email protected]%gcc  cxxflags="--std=c++14" ~dysect~examples~fgfs~gui
3ra646m     [email protected]%gcc  cxxflags="--std=c++14" +atomic+chrono~clanglibcpp~container~context~coroutine+date_time~debug+exception~fiber+filesystem+graph~icu+iostreams+locale+log+math~mpi+multithreaded~numpy~pic+program_options~python+random+regex+serialization+shared+signals~singlethreaded+system~taggedlayout+test+thread+timer~versionedlayout+wave cxxstd=98 patches=93f4aad8f88d1437e50d95a2d066390ef3753b99ef5de24f7a46bc083bd6df06 visibility=hidden
4ucshfz     [email protected]%gcc  cxxflags="--std=c++14" ~ipo+openmp~stat_dysect~static build_type=RelWithDebInfo
bp7lk52         [email protected]%gcc  cxxflags="--std=c++14" ~bzip2~debuginfod+nls~xz
5kqtqyt         [email protected]%gcc  cxxflags="--std=c++14" ~ipo+shared+tm build_type=RelWithDebInfo cxxstd=default patches=62ba015ebd1819c45bef47411540b789b493e31ca668c4ff4cb2afcbc306b476,ce1fb16fb932ce86a82ca87cf0431d1a8c83652af9f552b264213b2ff2945d73,d62cb666de4010998c339cde6f41c7623a07e9fc69e498f2e149821c0c2c6dd0
qizwje7         [email protected]%gcc  cxxflags="--std=c++14" +pic
7lrjx2k     [email protected]%gcc  cxxflags="--std=c++14" ~ipo build_type=RelWithDebInfo
j56c46j     [email protected]%gcc  cxxflags="--std=c++14" ~doc~expat~ghostscript~gtkplus~gts~java~libgd~pangocairo~poppler~qt~quartz~x
7zttv3a         [email protected]%gcc  cxxflags="--std=c++14" +optimize+pic+shared
42awyk6     launchmon@master%gcc  cxxflags="--std=c++14"
ehifwhj         [email protected]%gcc  cxxflags="--std=c++14"
nfkm5sn             [email protected]%gcc  cxxflags="--std=c++14"
xkkejlv     [email protected]%gcc  cxxflags="--std=c++14" ~lwthreads
cc2ohrr     [email protected]%gcc  cxxflags="--std=c++14" +bz2+ctypes+dbm~debug+libxml2+lzma+nis~optimizations+pic+pyexpat+pythoncmd+readline+shared+sqlite3+ssl~tix~tkinter~ucs4+uuid+zlib
p4xaimr     [email protected]%gcc  cxxflags="--std=c++14"
kjydyg7         [email protected]%gcc  cxxflags="--std=c++14" ~jit+multibyte+utf

I was trying to collect/compare backtraces for ten core files with a command like this:

stat-core-merger -x =bedrock -F stdout -c /gpfs/alpine/csc332/scratch/${USER}/quintain-cores/

after fixing up python's string/bye challenges (maybe I goofed that!) , the command hangs. Running with -L debug shows me

115      core_file_merger:589   VERBOSE  (MainThread) Processing started at 2022-02-17 09:43:54.919282
merging 10 trace files                                                                                                                                                                                                                                                              000%115      core_file_merger:352   INFO     (MainThread) Connecting gdb to the core file (/gpfs/alpine/csc332/scratch/robl/quintain-cores//core.2)
1226     core_file_merger:379   DEBUG    (MainThread) Checking for gdb errors
1601     core_file_merger:427   DEBUG    (MainThread) Find a value for the current rank

When I check with ps I see STAT is trying to do this:

 gdb -ex set pagination 0 -ex cd /autofs/nccs-svm1_home1/robl/src/mochi-quintain/tests -ex path /autofs/nccs-svm1_home1/robl/src/mochi-quintain/tests -ex directory /autofs/nccs-svm1_home1/robl/src/mochi-quintain/tests -ex set filename-display absolute --core=/gpfs/alpine/csc332/scratch/robl/quintain-cores//core.2 /autofs/nccs-svm1_home1/robl/src/spack/opt/spack/linux-rhel8-power9le/gcc-11.1.0/mochi-bedrock-main-ibxscgvcko74xoyb6sv4lphuiv3deryo/bin/bedrock

and when I run that command myself, gdb suggests it did not process the command line arguments as expected:

%  gdb -ex set pagination 0 -ex cd /autofs/nccs-svm1_home1/robl/src/mochi-quintain/tests -ex path /autofs/nccs-svm1_home1/robl/src/mochi-quintain/tests -ex directory /autofs/nccs-svm1_home1/robl/src/mochi-quintain/tests -ex set filename-display absolute --core=/gpfs/alpine/csc332/scratch/robl/quintain-cores//core.2 /autofs/nccs-svm1_home1/robl/src/spack/opt/spack/linux-rhel8-power9le/gcc-11.1.0/mochi-bedrock-main-ibxscgvcko74xoyb6sv4lphuiv3deryo/bin/bedrock
Excess command line arguments ignored. (0 ...)
GNU gdb (GDB) 10.2
Copyright (C) 2021 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "powerpc64le-unknown-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
pagination: No such file or directory.
[New LWP 150624]
Core was generated by `bedrock '.
Program terminated with signal SIGINT, Interrupt.
#0  0x0000200000b76118 in ?? ()
Argument required (expression to compute).
Working directory /ccs/home/robl
 (canonically /autofs/nccs-svm1_home1/robl).
Executable and object file path: /sw/summit/xalt/1.2.1/bin:/sw/sources/lsf-tools/2.0/summit/bin:/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-8.3.1/gdb-10.2-zl2qphcj4naoqsp6thilh4w5kkcf7n2u/bin:/autofs/nccs-svm1_home1/robl/src/spack/opt/spack/linux-rhel8-power9le/gcc-10.2.0/stat-4.1.0-wn2frxd57sysvqvapa65yd5sqflvi3sr/bin:/autofs/nccs-svm1_home1/robl/src/spack/opt/spack/linux-rhel8-power9le/gcc-10.2.0/swig-4.0.2-p4xaimrohrzqshwsefj7heh6f3df7bya/bin:/autofs/nccs-svm1_home1/robl/src/spack/opt/spack/linux-rhel8-power9le/gcc-10.2.0/pcre-8.44-kjydyg7oxoimrh47ooejkj2jtv3uke3f/bin:/autofs/nccs-svm1_home1/robl/src/spack/opt/spack/linux-rhel8-power9le/gcc-10.2.0/mrnet-5.0.1-3-xkkejlv2lt7xcsb65ga4thqntzrmoz3b/bin:/autofs/nccs-svm1_home1/robl/src/spack/opt/spack/linux-rhel8-power9le/gcc-10.2.0/launchmon-master-42awyk6qtdhwgsen7k3bqldrdzc2es2o/bin:/autofs/nccs-svm1_home1/robl/src/spack/opt/spack/linux-rhel8-power9le/gcc-10.2.0/libgcrypt-1.9.3-ehifwhjdwrb7tmapmkylstbqvp47gu62/bin:/autofs/nccs-svm1_home1/robl/src/spack/opt/spack/linux-rhel8-power9le/gcc-10.2.0/libgpg-error-1.42-nfkm5snffx46qwffiwfngffnwsql2y6u/bin:/autofs/nccs-svm1_home1/robl/src/spack/opt/spack/linux-rhel8-power9le/gcc-10.2.0/graphviz-2.49.0-j56c46j34im324olozfvvcmoslfphibq/bin:/autofs/nccs-svm1_home1/robl/src/spack/opt/spack/linux-rhel8-power9le/gcc-10.2.0/graphlib-3.0.0-7lrjx2kdz5rg4e5g6t33gkzko7wfbm7n/bin:/autofs/nccs-svm1_home1/robl/src/spack/opt/spack/linux-rhel8-power9le/gcc-10.2.0/dyninst-10.1.0-4ucshfzv5b574jurzctlbt7w3qxmgf2i/bin:/sw/summit/gcc/10.2.0-2/bin:/autofs/nccs-svm1_home1/robl/src/spack/opt/spack/linux-rhel8-power9le/gcc-11.1.0/mochi-quintain-main-nkuuhxcrvm3irrqrxctkfysukzyb2xue/bin:/autofs/nccs-svm1_home1/robl/src/spack/opt/spack/linux-rhel8-power9le/gcc-11.1.0/mochi-bedrock-main-ibxscgvcko74xoyb6sv4lphuiv3deryo/bin:/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-11.1.0/spectrum-mpi-10.4.0.3-20210112-6kg6anupjriji6pnvijebfn7ha5vsqp2/bin:/autofs/nccs-svm1_home1/robl/src/spack/opt/spack/linux-rhel8-power9le/gcc-11.1.0/mochi-margo-main-bt67pbipf3q56ijgm2ij7nzjnlbvhruo/bin:/autofs/nccs-svm1_home1/robl/src/spack/opt/spack/linux-rhel8-power9le/gcc-11.1.0/libfabric-1.13.2-hsk4mn4hjtnv7bnfptpzwhno4kjsqhvw/bin:/autofs/nccs-svm1_home1/robl/src/spack/opt/spack/linux-rhel8-power9le/gcc-11.1.0/mochi-abt-io-0.5.1-ir7rmxlx4ebamktb7xtwo5iqyyzuum4d/bin:/sw/sources/hpss/bin:/autofs/nccs-svm1_home1/robl/src/spack/bin:/opt/ibm/csm/bin:/opt/ibm/spectrumcomputing/lsf/10.1.0.11/linux3.10-glibc2.17-ppc64le-csm/etc:/opt/ibm/spectrumcomputing/lsf/10.1.0.11/linux3.10-glibc2.17-ppc64le-csm/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/ibm/flightlog/bin:/opt/ibm/jsm/bin:/sw/sources/cgroup_tool/bin:/opt/puppetlabs/bin:/usr/lpp/mmfs/bin
Reinitialize source path to empty? (y or n)

in particular pagination: No such file or directory and Excess command line arguments ignored

If I re-run that command with all the -ex arguments quoted, gdb will give me the (gdb) prompt that the python script expects

Hacking up scripts/core_file_merger.py to add those quotes gave me the command line I expected, however it still hangs at Find a value for the current rank.

When I ctrl-c the process, the python backtrace tells me it's stuck in info threads:

Traceback (most recent call last):
  File "/autofs/nccs-svm1_home1/robl/src/spack/opt/spack/linux-rhel8-power9le/gcc-10.2.0/stat-4.1.0-wn2frxd57sysvqvapa65yd5sqflvi3sr/lib/python3.6/site-packages/STATmain.py", line 134, in <module>
    STATmerge_main(sys.argv[1:])
  File "/autofs/nccs-svm1_home1/robl/src/spack/opt/spack/linux-rhel8-power9le/gcc-10.2.0/stat-4.1.0-wn2frxd57sysvqvapa65yd5sqflvi3sr/lib/python3.6/site-packages/core_file_merger.py", line 655, in STATmerge_main
    ret = merger.run()
  File "/autofs/nccs-svm1_home1/robl/src/spack/opt/spack/linux-rhel8-power9le/gcc-10.2.0/stat-4.1.0-wn2frxd57sysvqvapa65yd5sqflvi3sr/lib/python3.6/site-packages/stat_merge_base.py", line 314, in run
    trace_object = self.trace_type(filename, self.options)
  File "/autofs/nccs-svm1_home1/robl/src/spack/opt/spack/linux-rhel8-power9le/gcc-10.2.0/stat-4.1.0-wn2frxd57sysvqvapa65yd5sqflvi3sr/lib/python3.6/site-packages/stat_merge_base.py", line 49, in __init__
    self.traces = self.get_traces()
  File "/autofs/nccs-svm1_home1/robl/src/spack/opt/spack/linux-rhel8-power9le/gcc-10.2.0/stat-4.1.0-wn2frxd57sysvqvapa65yd5sqflvi3sr/lib/python3.6/site-packages/core_file_merger.py", line 535, in get_traces
    core_file.process_core()
  File "/autofs/nccs-svm1_home1/robl/src/spack/opt/spack/linux-rhel8-power9le/gcc-10.2.0/stat-4.1.0-wn2frxd57sysvqvapa65yd5sqflvi3sr/lib/python3.6/site-packages/core_file_merger.py", line 428, in process_core
    rank_value = self.get_function_value(gdb, 'MPI_Comm_rank', 1)
  File "/autofs/nccs-svm1_home1/robl/src/spack/opt/spack/linux-rhel8-power9le/gcc-10.2.0/stat-4.1.0-wn2frxd57sysvqvapa65yd5sqflvi3sr/lib/python3.6/site-packages/core_file_merger.py", line 216, in get_function_value
    lines = gdb.communicate("info threads")
  File "/autofs/nccs-svm1_home1/robl/src/spack/opt/spack/linux-rhel8-power9le/gcc-10.2.0/stat-4.1.0-wn2frxd57sysvqvapa65yd5sqflvi3sr/lib/python3.6/site-packages/core_file_merger.py", line 147, in communicate
    return self.readlines()
  File "/autofs/nccs-svm1_home1/robl/src/spack/opt/spack/linux-rhel8-power9le/gcc-10.2.0/stat-4.1.0-wn2frxd57sysvqvapa65yd5sqflvi3sr/lib/python3.6/site-packages/core_file_merger.py", line 128, in readlines
    ch = self.subprocess.stdout.read(1).decode('utf-8')

Any suggestions for next steps?
Thanks

@roblatham00 roblatham00 changed the title gdb stuck in 'reinitialize source path to empty' prompt stat-core-merger stuck communicating with gdb Feb 17, 2022
@roblatham00
Copy link
Contributor Author

roblatham00 commented Feb 17, 2022

I added additional logging to see what GDB is telling us. It is stuck here at a line that may or may not be needed on pppc64 le (which is the platform i'm on as it happens)

https://github.com/LLNL/STAT/blob/develop/scripts/core_file_merger.py#L409

I deleted that extra read but still had hangs with python3.

In the end I fell back to python-2.7 and now it's working (with that extra ppc64 readline deleted)

@lee218llnl
Copy link
Collaborator

lee218llnl commented Feb 18, 2022

for the gdb hang, you may need to comment out these 3 lines:

 if CoreFile.__options['cuda'] != 1:
            lines2 = gdb.readlines()
            lines += lines2

I don't exactly recall the history, but at some point we found this was necessary, but this appears to no longer be the case

@lee218llnl
Copy link
Collaborator

I just commited changes to the develop branch to comment out those lines

@roblatham00
Copy link
Contributor Author

A note for me to look one day at doing the gdb communication the other way around: instead of python reading gdb, have gdb execute a python script (https://sourceware.org/gdb/onlinedocs/gdb/Python-API.html)

@lee218llnl
Copy link
Collaborator

@roblatham00 Good news, I think I figured out the source of the hang. I was able to reproduce stat-core-merger hangs on one of our CORAL systems and managed to fix it by flushing the input buffer to the gdb process during communicate(). The change is in develop in this commit 19858dc. Also, I was able to install the develop branch with this commit on our CORAL system using the gcc 8.3.1 compiler. Can you try this out and let me know if this resolves your issue?

Note if you still have your previous STAT installation, you could just try to modify your installed core_file_merger.py file and add the flush after the stdin.write().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants