Pattern error message #130

brunoagneray · 2023-04-06T07:53:50Z

Hi,

We use lbnl-nhc-1.4.3-1 version.

We have nodes with name like cluster-n[01-99] and storage nodes with name like cluster-nfs[01-99].

We have the following lines in our nhc.conf file :

{cluster-n[01-99]} || export NHC_RM=
{cluster-nfs[01-99]} || export NHC_RM=

When executing the command 'nhc -a' on a storage node in cluster-nfs, we encounter the following error message like :

/etc/nhc/scripts/common.nhc: line 201: [[: 10#fs15: value too great for base (error token is "10#fs15")

Regards,

Bruno

mej · 2023-04-20T20:26:23Z

I think what's going on here is that NHC is getting confused by the fact that the leading portion (what the code refers to as the PREFIX) of the hostname cluster-nfs15, when matched against the range expression cluster-n[01-99], is taken to be cluster-n. Once that gets trimmed off, it then tries to treat the remainder of the hostname (i.e., fs15) as a number that it then tries to compare with the range 01-99 to see if the "number" fs15 falls within that range.

Because Bash auto-interprets numbers in bases other than 10 under certain circumstances, the range-matching code prepends 10# to the numeric variables to ensure they get treated as base-10 numbers in all cases. In this situation, however, fs is getting erroneously lumped into the numeric value, and as the error message says, f and s don't fall within the range of digits that are valid for base-10 numbers.

I'll see if I can reproduce the problem myself by hand, but if you'd be willing to attach the output from running nhc -ax on that cluster-nfs15 host, that'd help a lot! 😀 In the meantime, though, the error shouldn't be causing any actual breakage -- range expression matching should still be working accurately, right?

Thanks for reporting the bug!

PS: As a possible workaround, for the time being, you could change to a glob expression like cluster-n[0-9][0-9] or a regular expression like /^cluster-n[[:digit:]]+$/.

brunoagneray · 2023-04-21T07:55:52Z

Hi Michael, Thanks for your answer. Please find the output of the 'nhc -ax' command on cluster-nfs15. PS: As a possible workaround, for the time being, you could change to a glob expression like |cluster-n[0-9][0-9]| or a regular expression like |/^cluster-n[[:digit:]]+$/|. We use the same nhc.conf on all our nodes (heterogenous nodes, in SLURM or not), there is 42 patterns (pdsh pattern with {}) to modify. As the errors are only present on our spiro-nfs[01-15] nodes, and the reason of this messages is identified without impact, we will be patient. Many thanks for your support ! Best regards, Bruno Bruno AGNERAY - DSI Service Infrastructure Système et Réseaux / Calcul Scientifique Intensif Tél: +33 1 46 73 44 10 Mail ***@***.*** ONERA - The French Aerospace Lab - Centre de Châtillon 29, avenue de la Division Leclerc - BP 72 - 92322 CHÂTILLON CEDEX Le 20/04/2023 à 22:26, Michael Jennings a écrit :

I think what's going on here is that NHC is getting confused by the fact that the leading portion (what the code refers to as the |PREFIX|) of the hostname |cluster-nfs15|, when matched against the range expression |cluster-n[01-99]|, is taken to be |cluster-n|. Once that gets trimmed off, it then tries to treat the remainder of the hostname (i.e., |fs15|) as a number that it then tries to compare with the range |01-99| to see if the "number" |fs15| falls within that range. Because Bash auto-interprets numbers in bases other than 10 under certain circumstances, the range-matching code prepends |10#| to the numeric variables <https://github.com/mej/nhc/blob/1.4.3/scripts/common.nhc#L201> to ensure they get treated as base-10 numbers in all cases. In this situation, however, |fs| is getting erroneously lumped into the numeric value, and as the error message says, |f| and |s| don't fall within the range of digits that are valid for base-10 numbers. I'll see if I can reproduce the problem myself by hand, but if you'd be willing to attach the output from running |nhc -ax| on that |cluster-nfs15| host, that'd help a lot! 😀 In the meantime, though, the error shouldn't be causing any actual breakage -- range expression matching should still be working accurately, right? Thanks for reporting the bug! PS: As a possible workaround, for the time being, you could change to a glob expression like |cluster-n[0-9][0-9]| or a regular expression like |/^cluster-n[[:digit:]]+$/|. — Reply to this email directly, view it on GitHub <#130 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ARF4OLJ57JF2NLY6BPYKTP3XCGLXXANCNFSM6AAAAAAWVBWFIY>. You are receiving this because you authored the thread.Message ID: ***@***.***>

***@***.*** ~]# nhc -ax

…

***@***.***:342:nhcmain_parse_cmdline()]> dbg 'BASH tracing active.' ***@***.***:99:dbg()]> local PREFIX= ***@***.***:101:dbg()]> [[ '' == \1 ]] ***@***.***:328:nhcmain_parse_cmdline()]> getopts :D:ac:de:fhl:n:qt:vx OPTION ***@***.***:347:nhcmain_parse_cmdline()]> shift 1 ***@***.***:348:nhcmain_parse_cmdline()]> [[ ! -z '' ]] ***@***.***:352:nhcmain_parse_cmdline()]> return 0 ***@***.***:729:main()]> nhcmain_load_sysconfig ***@***.***:359:nhcmain_load_sysconfig()]> [[ -f /etc/sysconfig/nhc ]] ***@***.***:730:main()]> nhcmain_finalize_env ***@***.***:367:nhcmain_finalize_env()]> CONFDIR=/etc/nhc ***@***.***:368:nhcmain_finalize_env()]> CONFFILE=/etc/nhc/nhc.conf ***@***.***:369:nhcmain_finalize_env()]> INCDIR=/etc/nhc/scripts ***@***.***:370:nhcmain_finalize_env()]> HELPERDIR=/usr/libexec/nhc ***@***.***:371:nhcmain_finalize_env()]> ONLINE_NODE=/usr/libexec/nhc/node-mark-online ***@***.***:372:nhcmain_finalize_env()]> OFFLINE_NODE=/usr/libexec/nhc/node-mark-offline ***@***.***:373:nhcmain_finalize_env()]> LOGFILE='>>/var/log/nhc.log 2>&1' ***@***.***:374:nhcmain_finalize_env()]> RESULTFILE=/var/run/nhc/nhc.status ***@***.***:375:nhcmain_finalize_env()]> DEBUG=0 ***@***.***:376:nhcmain_finalize_env()]> TS=0 ***@***.***:377:nhcmain_finalize_env()]> SILENT=0 ***@***.***:378:nhcmain_finalize_env()]> VERBOSE=0 ***@***.***:379:nhcmain_finalize_env()]> MARK_OFFLINE=1 ***@***.***:380:nhcmain_finalize_env()]> DETACHED_MODE=0 ***@***.***:381:nhcmain_finalize_env()]> DETACHED_MODE_FAIL_NODATA=0 ***@***.***:382:nhcmain_finalize_env()]> TIMEOUT=30 ***@***.***:383:nhcmain_finalize_env()]> NHC_CHECK_ALL=1 ***@***.***:384:nhcmain_finalize_env()]> NHC_CHECK_FORKED=0 ***@***.***:385:nhcmain_finalize_env()]> export NHC_SID=0 ***@***.***:385:nhcmain_finalize_env()]> NHC_SID=0 ***@***.***:388:nhcmain_finalize_env()]> kill -s 0 -- -784937 ***@***.***:389:nhcmain_finalize_env()]> [[ 0 -eq 0 ]] ***@***.***:391:nhcmain_finalize_env()]> dbg 'NHC process 784937 is session leader.' ***@***.***:99:dbg()]> local PREFIX= ***@***.***:101:dbg()]> [[ 0 == \1 ]] ***@***.***:392:nhcmain_finalize_env()]> NHC_SID=-784937 ***@***.***:405:nhcmain_finalize_env()]> [[ -n '' ]] ***@***.***:410:nhcmain_finalize_env()]> [[ >>/var/log/nhc.log 2>&1 != \>\>\/\v\a\r\/\l\o\g\/\n\h\c\.\l\o\g\ \2\>\&\1 ]] ***@***.***:413:nhcmain_finalize_env()]> [[ >>/var/log/nhc.log 2>&1 == \- ]] ***@***.***:418:nhcmain_finalize_env()]> [[ -z '' ]] ***@***.***:419:nhcmain_finalize_env()]> nhcmain_find_rm ***@***.***:455:nhcmain_find_rm()]> local DIR ***@***.***:456:nhcmain_find_rm()]> local -a DIRLIST ***@***.***:458:nhcmain_find_rm()]> [[ -d /var/spool/torque ]] ***@***.***:461:nhcmain_find_rm()]> [[ -n '' ]] ***@***.***:468:nhcmain_find_rm()]> type -a -p -f -P scontrol ***@***.***:471:nhcmain_find_rm()]> type -a -p -f -P pbsnodes ***@***.***:474:nhcmain_find_rm()]> type -a -p -f -P qselect ***@***.***:477:nhcmain_find_rm()]> type -a -p -f -P badmin ***@***.***:477:nhcmain_find_rm()]> type -a -p -f -P sbatchd ***@***.***:482:nhcmain_find_rm()]> [[ -z '' ]] ***@***.***:483:nhcmain_find_rm()]> dbg 'Unable to detect resource manager.' ***@***.***:99:dbg()]> local PREFIX= ***@***.***:101:dbg()]> [[ 0 == \1 ]] ***@***.***:484:nhcmain_find_rm()]> return 1 ***@***.***:420:nhcmain_finalize_env()]> ONLINE_NODE=: ***@***.***:421:nhcmain_finalize_env()]> OFFLINE_NODE=: ***@***.***:422:nhcmain_finalize_env()]> MARK_OFFLINE=0 ***@***.***:425:nhcmain_finalize_env()]> [[ '' == \s\g\e ]] ***@***.***:436:nhcmain_finalize_env()]> [[ 0 -ne 0 ]] ***@***.***:443:nhcmain_finalize_env()]> [[ -n '' ]] ***@***.***:445:nhcmain_finalize_env()]> [[ 0 -eq 1 ]] ***@***.***:451:nhcmain_finalize_env()]> export NAME CONFDIR CONFFILE INCDIR HELPERDIR ONLINE_NODE OFFLINE_NODE LOGFILE DEBUG TS SILENT TIMEOUT NHC_RM ***@***.***:731:main()]> [[ -n '' ]] ***@***.***:736:main()]> nhcmain_redirect_output ***@***.***:489:nhcmain_redirect_output()]> [[ -n >>/var/log/nhc.log 2>&1 ]] ***@***.***:490:nhcmain_redirect_output()]> exec ***@***.***:710:nhcmain_finish()]> exit 0

mej self-assigned this Apr 20, 2023

mej added the bug label Apr 20, 2023

mej added this to the 1.4.4 Release milestone Apr 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pattern error message #130

Pattern error message #130

brunoagneray commented Apr 6, 2023

mej commented Apr 20, 2023

brunoagneray commented Apr 21, 2023 via email

Pattern error message #130

Pattern error message #130

Comments

brunoagneray commented Apr 6, 2023

mej commented Apr 20, 2023

brunoagneray commented Apr 21, 2023 via email