Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some subtree actions dispatched via dispatcher fails, if dispatch server is run with "-c" option #2828

Open
cenkoloji opened this issue Nov 18, 2024 · 4 comments
Assignees
Labels
bug An unexpected problem or unintended behavior tool/mdsip Relates to one of the MDSplus over IP tools (mdsipd, mdsipsd, mdsip_server)

Comments

@cenkoloji
Copy link
Contributor

cenkoloji commented Nov 18, 2024

Affiliation
EPFL/Swiss Plasma Center

Version(s) Affected/Platform(s)
RHEL9 - mdsplus-alpha 7.139-60
RHEL7 - mdsplus-stable 7.142.81

Installation Method(s)
yum/dnf

Describe the bug

  • We have a main tree, which has few subtrees with actions
  • We run 1 dispatch and several action servers, all with -c 9 (compression) argument.
  • We observed that some actions failed when dispatched from dispatch server to action servers.
  • With help of @joshStillerman I tracked it down to -c 9 argument.
  • For me, issue is ONLY reproducible in presence of multiple subtrees, and dispatch server running with -c option.

To Reproduce

  • On 2 separate terminals, start 2 servers (Change ports if these ones are not available)
mdsip -p 9999 -s -h ./mdsip.hosts  -c 9   # Dispatch server
mdsip -p 9998 -s -h ./mdsip.hosts  -c 9   # Action server
  • Run following python program that will create sub-trees, main tree, set paths everywhere (current shell and action servers), then dispatch actions
#!/usr/bin/env python
# Make a test tree to test dispatcher
# Before running this script, start 2 actions servers on 2 terminals
#  mdsip -p 9999 -s -h ./mdsip.hosts  -c 9   # Dispatch server
#  mdsip -p 9998 -s -h ./mdsip.hosts  -c 9   # Action server

import os
import time
import MDSplus

# Create tree path dir
user = os.environ['USER']
distest_path="/tmp/{}/distest".format(user)
os.system('mkdir -p {}'.format(distest_path))
os.environ['distest_path'] = distest_path
os.environ['dissub1_path'] = distest_path
os.environ['dissub2_path'] = distest_path

# Set environment vars for tree paths in the servers
for port in ["9998","9999"]:
    c = MDSplus.Connection('localhost:' + port)
    for t in ['distest','dissub1','dissub2']:
        c.get('setenv("{}_path=/tmp/{}/distest/")'.format(t,user))
    c.disconnect()

# Create subtrees
for treename in ['dissub1','dissub2']:
    t = MDSplus.Tree(treename,1,'NEW')

    template = 'Build_Action(Build_Dispatch(2, "{}", "INIT", 10, *), {}, *, *, *)'

    # Each action is a simple write expression with tree/action name
    for i in range(1,4):
        nodename = 'act{:03d}'.format(i)
        node = t.addNode(nodename,usage="ACTION")
        server = "localhost:9998"
        function_expr = 'write(*,"Action of {}:{}")'.format(treename,nodename)
        expr = template.format(server, function_expr)
        node.record = t.tdiCompile(expr)

    t.write()
    t.close()

# Create main tree
t = MDSplus.Tree('distest',1,'NEW')
t.addNode('DISSUB1',usage='SUBTREE')
t.addNode('DISSUB2',usage='SUBTREE')
t.write()
t.close()

# Run actions
MDSplus.tcl('dispatch /command /server=localhost:9999 set tree distest /shot=1')
time.sleep(1)
MDSplus.tcl('dispatch /command /server=localhost:9999 dispatch/build')
time.sleep(1)
MDSplus.tcl('dispatch /command /server=localhost:9999 dispatch/phase init')
time.sleep(1) 
MDSplus.tcl('dispatch /command /server=localhost:9999 dispatch/close') 
time.sleep(1) 
MDSplus.tcl('dispatch /command /server=localhost:9999 close/shot=1 distest') 
  • Check logs of dispatch server for errors
  • Repeat the experiment removing -c 9 argument

Log of dispatch server at localhost:9999 (This is from RHEL9, alpha build)

Mon Nov 18 23:26:40 2024 (4) (pid 1106189) Connection received from yildiz@localhost [127.0.0.1]
Connected: yildiz
dispatched client 127.0.0.1
Mon Nov 18 23:26:40 2024 (4) (pid 1106189) Connection disconnected from yildiz@localhost [127.0.0.1]
Mon Nov 18 23:26:40 2024 (4) (pid 1106189) Connection received from yildiz@localhost [127.0.0.1]
Connected: yildiz
dispatched client 127.0.0.1
I, 1106189:1106305, 1731968800.998966397: /opt/jenkins/workspace/MDSplus_alpha/rhel9/servershr/ServerQAction.c:951 setup_client() setup connection 8 SvrJob(op=6, jobid=0, addr=127.0.0.1, port=8800)
Mon Nov 18 23:26:43 2024 (4) (pid 1106189) Connection disconnected from yildiz@localhost [127.0.0.1]
E, 1106189:1106305, 1731968804.784366888: /opt/jenkins/workspace/MDSplus_alpha/rhel9/servershr/ServerQAction.c:860 open_socket() Cannot connect to 127.0.0.1:8800: Connection refused
I, 1106189:1106305, 1731968804.784389394: /opt/jenkins/workspace/MDSplus_alpha/rhel9/servershr/ServerQAction.c:1008 send_reply() SvrJob(op=6, jobid=1, addr=127.0.0.1, port=8800) break connection
Mon Nov 18 23:26:44 2024, Dispatching node .DISSUB1:ACT001 to localhost:9998
W, 1106189:1106305, 1731968804.786785578: /opt/jenkins/workspace/MDSplus_alpha/rhel9/servershr/ServerSendMessage.c:229 ServerSendMessage() no response from server
Mon Nov 18 23:26:44 2024, Action .DISSUB1:ACT001 failed, %MDSPLUS-E-ERROR, Error
Mon Nov 18 23:26:44 2024, Action .DISSUB1:ACT001 failed, %MDSPLUS-E-ERROR, Error
Mon Nov 18 23:26:44 2024, Dispatching node .DISSUB1:ACT002 to localhost:9998
W, 1106189:1106305, 1731968804.787646046: /opt/jenkins/workspace/MDSplus_alpha/rhel9/servershr/ServerSendMessage.c:229 ServerSendMessage() no response from server
Mon Nov 18 23:26:44 2024, Action .DISSUB1:ACT002 failed, %MDSPLUS-E-ERROR, Error
Mon Nov 18 23:26:44 2024, Action .DISSUB1:ACT002 failed, %MDSPLUS-E-ERROR, Error
Mon Nov 18 23:26:44 2024, Dispatching node .DISSUB1:ACT003 to localhost:9998
W, 1106189:1106305, 1731968804.788400405: /opt/jenkins/workspace/MDSplus_alpha/rhel9/servershr/ServerSendMessage.c:229 ServerSendMessage() no response from server
Mon Nov 18 23:26:44 2024, Action .DISSUB1:ACT003 failed, %MDSPLUS-E-ERROR, Error
Mon Nov 18 23:26:44 2024, Action .DISSUB1:ACT003 failed, %MDSPLUS-E-ERROR, Error
Mon Nov 18 23:26:44 2024, Dispatching node .DISSUB2:ACT001 to localhost:9998
Mon Nov 18 23:26:44 2024, localhost:9998 is beginning action .DISSUB2:ACT001
Mon Nov 18 23:26:44 2024, Dispatching node .DISSUB2:ACT002 to localhost:9998
Mon Nov 18 23:26:44 2024, Action .DISSUB2:ACT001 completed
Mon Nov 18 23:26:44 2024, localhost:9998 is beginning action .DISSUB2:ACT002
Mon Nov 18 23:26:44 2024, Dispatching node .DISSUB2:ACT003 to localhost:9998
Mon Nov 18 23:26:44 2024, Action .DISSUB2:ACT002 completed
Mon Nov 18 23:26:44 2024, localhost:9998 is beginning action .DISSUB2:ACT003
Mon Nov 18 23:26:44 2024, Action .DISSUB2:ACT003 completed
I, 1106189:1106305, 1731968804.791971854: /opt/jenkins/workspace/MDSplus_alpha/rhel9/servershr/ServerQAction.c:951 setup_client() setup connection 17 SvrJob(op=6, jobid=2, addr=127.0.0.1, port=8800)
W, 1106189:1106311, 1731968804.792059552: /opt/jenkins/workspace/MDSplus_alpha/rhel9/servershr/ServerSendMessage.c:620 accept_client() Dropped connection from 127.0.0.1:43762

Expected behavior
All actions are successful and dispatched

Additional context
We do not require this bug to be fixed for our site, we don't need to use "-c" option. I'm just reporting for sake of

@cenkoloji cenkoloji added the bug An unexpected problem or unintended behavior label Nov 18, 2024
@mwinkel-dev mwinkel-dev added the tool/mdsip Relates to one of the MDSplus over IP tools (mdsipd, mdsipsd, mdsip_server) label Nov 20, 2024
@joshStillerman
Copy link
Contributor

I can reproduce this on the current stable without the -c on either of the servers.
The test script works for the first time that you run it, and chokes tthe second time. The dispatcher segflaults when trying to look at the treeheader of the parent tree the 2nd time.

@cenkoloji
Copy link
Contributor Author

That's due to missing dispatch/close. Adding dispatch/close and close statement, you can run it many times.

I update ticket description with correct clean up actions.

@joshStillerman
Copy link
Contributor

Yes, thanks - however I think this is a bug in TreeOpen.

@joshStillerman
Copy link
Contributor

If I run your script twice the dispatcher (port 9999) does in fact segfault. However it segfaults regardless of the compression flag. What is happening is that the python program that drives the whole thing, needs to:

  1. tell the dispatcher to execute the command dispatch/close
  2. tell the dispatcher to execute the command close

Otherwise the edited tree shot 1 is still on its stack, and TreeOpen is quite confused.

Here is the new test program (I added a second shot, which had no impact)

#!/usr/bin/env python
# Make a test tree to test dispatcher
# Before running this script, start 2 actions servers on 2 terminals
#  mdsip -p 9999 -s -h ./mdsip.hosts  -c 9   # Dispatch server
#  mdsip -p 9998 -s -h ./mdsip.hosts  -c 9   # Action server

import os
import time
import MDSplus

# Create tree path dir
user = os.environ['USER']
distest_path="/tmp/{}/distest".format(user)
os.system('mkdir -p {}'.format(distest_path))
os.environ['distest_path'] = distest_path
os.environ['dissub1_path'] = distest_path
os.environ['dissub2_path'] = distest_path

# Set environment vars for tree paths in the servers
for port in ["9998","9999"]:
    c = MDSplus.Connection('localhost:' + port)
    for t in ['distest','dissub1','dissub2']:
        c.get('setenv("{}_path=/tmp/{}/distest/")'.format(t,user))
    c.disconnect()

# Create subtrees
for treename in ['dissub1','dissub2']:
    t = MDSplus.Tree(treename,-1,'NEW')

    template = 'Build_Action(Build_Dispatch(2, "{}", "INIT", 10, *), {}, *, *, *)'

    # Each action is a simple write expression with tree/action name
    for i in range(1,4):
        nodename = 'act{:03d}'.format(i)
        node = t.addNode(nodename,usage="ACTION")
        server = "localhost:9998"
        function_expr = 'write(*,"Action of {}:{}")'.format(treename,nodename)
        expr = template.format(server, function_expr)
        node.record = t.tdiCompile(expr)

    t.write()
    t.close()

# Create main tree
t = MDSplus.Tree('distest',-1,'NEW')
t.addNode('DISSUB1',usage='SUBTREE')
t.addNode('DISSUB2',usage='SUBTREE')
t.write()
t.close()

# Run actions
MDSplus.tcl('set tree distest/shot=-1')
time.sleep(1)
MDSplus.tcl('create pulse 1')
MDSplus.tcl('dispatch /command /server=localhost:9999 set tree distest /shot=1')
time.sleep(1)
MDSplus.tcl('dispatch /command /server=localhost:9999 dispatch/build')
time.sleep(1)
MDSplus.tcl('dispatch /command /server=localhost:9999 dispatch/phase init')
time.sleep(1)
MDSplus.tcl('dispatch /command /server=localhost:9999 dispatch/close')
MDSplus.tcl('dispatch /command /server=localhost:9999 close')

print('done with shot 1')
MDSplus.tcl('create pulse 2')
time.sleep(1)
print('open shot 2')
MDSplus.tcl('dispatch /command /server=localhost:9999 set tree distest /shot=2')
time.sleep(1)
print('dispatch build')
MDSplus.tcl('dispatch /command /server=localhost:9999 dispatch/build')
time.sleep(1)
print('dispatch the phase')
MDSplus.tcl('dispatch /command /server=localhost:9999 dispatch/phase init')
MDSplus.tcl('dispatch /command /server=localhost:9999 dispatch/close')
MDSplus.tcl('dispatch /command /server=localhost:9999 close')

I assume that what promted this bug report originally was not a scenario where the pulse file was being editted between shots.

Also, I do not know if we can detect that this is happening to us. One possibility is to have a way to signal processes that the tree has been editted and needs to be reopened - kind of like the 'close_all_trees' event

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug An unexpected problem or unintended behavior tool/mdsip Relates to one of the MDSplus over IP tools (mdsipd, mdsipsd, mdsip_server)
Projects
None yet
Development

No branches or pull requests

3 participants