Replies: 6 comments
-
Thanks @spophale for putting this together! My main feedback is that I think it would be good to explicitly call out the various debugging sub use-cases and then organize the main use-case based on that. For example (not sure if this is the correct list of sub use-cases):
Maybe @rhc54 can comment on this list: if these are relevant sub use-cases and any other sub use-cases that he is aware of that we missed. For the main use-case and each sub use-case, I think it would also be beneficial to explicitly list the exact interfaces/attributes/etc that are used and whether they are required or optional for the sub use-case. If that information isn't spelled out in our RFC notes, then we should make a note of that and bring some extra people onto the call to fill in those gaps. |
Beta Was this translation helpful? Give feedback.
-
There are some possibly relevant write-ups on the web site: IO forwarding for tools These may well be a little outdated as quite a bit of work was done this year, but they at least walk you through the basics. The only thing I see missing from @SteVwonder list is the "indirect-launch" case: where the debugger spawns an intermediate launcher (e.g., "mpirun") that in turn launches the actual application. This case heavily depends on the event notification system for coordinating the launch - you might find the writeup on it of use to you. |
Beta Was this translation helpful? Give feedback.
-
Updated the original post with a snapshot of the use-case from our Google Drive drafts folder: https://drive.google.com/open?id=1eN7aBxyzPD0a_GJFq1KH2ZHpoONj76op |
Beta Was this translation helpful? Give feedback.
-
Thanks @spophale for the recent edits. I agree that the use-case flows quite nicely now with the "Tool Interaction" section after the launching sections. One thing I noticed on my read through is that we are missing the Interfaces
Attributes/Directives
It looks like some of those same attributes can go in 1 and 2 too. |
Beta Was this translation helpful? Give feedback.
-
Working on the coverage analysis and a few of the interfaces mentioned weren't matching with definitions in the standard. Suggested changes below: diff --git a/src/coverage-data/use-cases/debugging.md b/src/coverage-data/use-cases/debugging.md
index dbfa097..6bb1203 100644
--- a/src/coverage-data/use-cases/debugging.md
+++ b/src/coverage-data/use-cases/debugging.md
@@ -77,7 +77,7 @@ PMIx_tool_init
PMIx_Register_event_handler
PMIx_Spawn
PMIx_Notify_event
-PMIx_tool_connect_server
+PMIx_tool_connect_to_server
PMIx_Query_info
PMIx_Get
@@ -144,8 +144,8 @@ Tools can benefit from a mechanism by which they may interact with a local PMIx
#### Interfaces
-PMIx_Query_nb
-PMIx_Server_init
+PMIx_Query_info_nb
+PMIx_server_init
PMIx_Register_event_handler
PMIx_Deregister_event_handler
PMIx_Notify_event |
Beta Was this translation helpful? Give feedback.
-
The v5.0.x PR #328 included this use case. The issue will remain open for further discussion on this topic. |
Beta Was this translation helpful? Give feedback.
-
Brief Description
This use case is an attempt to distill out the features/extensions requested in the RFCs that are related to debugging. We have identified parts of PR23 (Co-located process launch for debuggers), RFC0010 (MPIR-like query), RFC0002 (event pub/sub), and RFC0022 (Environmental Parameter Directives for Applications and Launchers) under this category.
Terminology
Tools vs Debuggers
ptrace
)Parallel Launching Methods
A starter program is a program responsible for launching a parallel runtime, such as MPI. PMIx supports two primary methods for launching parallel applications under tools and debuggers: indirect and direct. In the indirect launching method, the tool is attached to the starter. In the direct launching method, the tool takes the place of the starter. PMIx also supports attaching to already running programs via the Process Acquisition interfaces.
Process Synchronization
Process Synchronization is the technique tools use to start the processes of a parallel application such that the tools can still attach to the process early in it's lifetime. Said another away, the tool must be able to start the application processes without them "running away" from the tool. In the case of MPI, this means stopping the applications processes before they return from
MPI_Init
.Process Acquisition
Process Acquisition is technique tools use to locate all of the processes, local and remote, of a given parallel application. This typically boils down to collecting for every process in the parallel application:
Use Case Details
1. Direct-Launch Debugger Tool
PMIx can support the tool itself using the PMIx spawn options to control the app’s startup, including directing the RM/application as to when to block and wait for tool attachment, or stipulating that an interceptor library be preloaded. However, this means that the user is restricted to whatever command line options the tool vendor has provided for operations such as process placement and binding, which places a significant burden on the tool vendor. An example might look like the following:
dbgr -n 3 ./myapp
Assuming it is supported, co-launch of debugger daemons in this use-case is supported by adding a
pmix_app_t
to thePMIx_Spawn command
, indicating that the resulting processes are debugger daemons by setting thePMIX_DEBUGGER_DAEMONS
attribute.Interfaces
Attributes/Directives
2. Indirect-Launch Debugger Tool
Executing a program under a tool using an intermediate launcher such as mpiexec can also be made possible. This requires some degree of coordination between the tool and the launcher. Ultimately, it is the launcher that is going to launch the application, and the tool must somehow inform it (and the application) that this is being done in a debug session so that the application knows to “block” until the tool attaches to it.
In this operational mode, the user invokes a tool (typically on a non-compute, or “head”, node) that in turn uses mpiexec to launch their application – a typical command line might look like the following:
dbgr -dbgoption mpiexec -n 32 ./myapp
Interfaces
Attributes/Directives
3. Attaching to a Running Job
PMIx supports attaching to an already running parallel job in two ways. In the first way, the main process of a tool calls
PMIx_Query_info
with thePMIX_QUERY_PROC_TABLE
attribute. This returns an array of structs containing the information required for process acquisition. This includes remote hostnames, executable names, and process IDs. In the second way, every tool daemon callsPMIx_Query_info
with thePMIX_QUERY_LOCAL_PROC_TABLE
attribute. This returns a similar array of structs but only for processes on the same node.An example of this use-case may look like the following:
Interfaces
Attributes/Directives
4. Tool Interaction with RM
Tools can benefit from a mechanism by which they may interact with a local PMIx server that has opted to accept such connections along with support for tool connections to system-level PMIx servers, and a logging feature. To add support for tool connections to a specified system-level, PMIx server environments could choose to launch a set of PMIx servers to support a given allocation - these servers will (if so instructed) provide a tool rendezvous point that is tagged with their pid and typically placed in an allocation-specific temporary directory to allow for possible multi-tenancy scenarios. Supporting such operations requires that a system-level PMIx connection be provided which is not associated with a specific user or allocation. A new key has been added to direct the PMIx server to expose a rendezvous point specifically for this purpose.
Interfaces
Job-specific events
PMIX_EVENT_JOB_LEVEL /* debugger attached, process failure */
Environment events
PMIX_EVENT_ENVIRO_LEVEL /*ECC errors, temperature excursions */
Errors detected by clients/peers
Network fabric manager detects data corruption
5. Environmental Parameter Directives for Applications and Launchers
It is sometimes desirable or required that standard environmental variables (e.g.,
PATH
,LD_LIBRARY_PATH
,LD_PRELOAD
) be modified prior to executing an application binary or a starter such as mpiexec - this is particularly true when tools/debuggers are used to start the application. This RFC proposes the definition of a new PMIx structure (pmix_envar_t) and associated attributes for specifying such operations.Interfaces
Attributes/Directives
Resource managers and launchers must scan for relevant directives, modifying environmental parameters as directed. Directives are to be processed in the order in which they were given, starting with job-level directives (applied to each app) followed by app-level directives.
References
Beta Was this translation helpful? Give feedback.
All reactions