This is a step-by-step guide to creating a fully automated experiment with the use of Globus tools via the Gladier automation library. Full documentation can be found at Gladier Read the Docs; see also Globus-Compute for information about the remote execution mechanisms used in our automation.
Hello traveler. Welcome to this guide to the Globus Architecture for Data-Intensive Experimental Research, in short, GLADIER--a library designed to facilitate the authoring of flows, sequences of actions (e.g., running a program, transferring data, publishing data) associated with the processing of data from experimental facilities.
This step-by-step guide to the use of Gladier and associated tools will prepare you to author and run multi-step flows.
A
flow
is a sequence of actions that are to be performed. This sequence is defined by arecipe
(also known as a flow-definition). A flow-definition contains all information necessary to execute each step of the flow, including:
* What: The actions to be executed
* Where: Where each action is to be executed
* When: The order in which the actions are executed
A client is the code that defines all necessary information to run a flow. To simplify understanding, we work here with three example flows that each comprise a single step representing three types of operation:
- The Process flow, at
simple_clients/example_client_process.py
, runs a program at a specified location - The Transfer flow, at
simple_clients/example_client_transfer.py
, transfers data from one location to another - The Publish flow, at
simple_clients/example_client_publish.py
, loads data and metadata to a catalog
For full automation of an experiment, there is infrastructure setup that cannot be ignored. This is usually done once, after which you can focus on tool development.
We advise creating a new Python environment to run this tutorial. For simplicity, we use miniconda:
conda create -n gladier-test python pip
conda activate gladier-test
pip install gladier
Our first example, Process (simple_clients/example_client_process.py
) defines a single step flow that executes a Python function at a specified remote location.
./simple_clients/example_client_process.py
In order to execute a remote function (the equivalent of a lambda
) we use the Globus-Compute service to register, retrieve, and execute functions. The service requires that a small instance client be deployed at the "processing" machine. We have a client running on one of our example machines. The location of this client is defined by:
'compute_endpoint': '4b116d3c-1703-4f8f-9f6f-39921e5864df'
Running the Globus Compute Client on a computer gives a flow the ability to
send
functions to be executed on that computer. (This computer may be a remote cluster of cloud, or alternatively the computer on which the flow client executes: Globus Compute runs the same in each case.) As we explore later, different parts of the flow can be executed on different computers by providing differentcompute_endpoint
value for each part.
The Process flow runs the function defined by tools/simple_compute_tool.py
and imported into the client as SimpleTool
. Gladier automatically checks for updates on the function definition and registers or re-registers (if the local function definition has changed) with the Globus Compute service. The UUID of this function is automatically populated in the flow definition.
Before the execution the input
variables need to be defined to contain all information necessary for the flow. In our case, that information includes two input parameters expected by SimpleTool
(name
and wfile
) and the location of the computer where the function is to execute, compute_endpoint
.
flow_input = {
'input': {
'name': args.name,
'wfile' : '/test/test.txt',
# Globus Compute tutorial endpoint
'compute_endpoint': '4b116d3c-1703-4f8f-9f6f-39921e5864df',
}
}
At the next step in the execution, Gladier looks for changes in the Tool definition and tool sequence, then takes care of creating the flow definition and (re-)registering it with the globus automate service. Once a flow is created, it can be accessed in the globus webApp like the link below.
Flow created with ID: ddb9d6be-d48f-40df-a663-6bcc6db1bb76 https://app.globus.org/flows/ddb9d6be-d48f-40df-a663-6bcc6db1bb76
Once a flow is .run()
it generates a unique UUID for that particular set of flow definition and payload. The log can also be accessed at the webApp:
Run started with ID: 6fa0969a-2778-4f7f-95d5-d365e89aca32 https://app.globus.org/runs/6fa0969a-2778-4f7f-95d5-d365e89aca32
Running the client again will not register a new flow with the globus service but will generate a new run instance.
- We suggest keeping the client
def
outsite of the__main__
function of the python file. Then creating an instance at__main__
and using.run()
. - Each tool in the flow have separate
required_inputs
that need to be included in the initial payload. - The
SimpleTool
and its driving functionsimple_compute_function
are separated into atools
folder in a single file. We advise to create one python file per "action" in the flows. This makes development and debugging and tracing errors much simpler. - The
example_client.py
itself also is separated from the other clients in the folder and only contain oneGladierBaseClient
. This prevents instances being created with the 'wrong' flow definition or common mistakes on 'what is running'.
Our second example can be found at simple_clients/example_client_transfer.py
. It transfer a file from our remote server into your workstation.
In order to allow for transfer, the first step is to introduce the workstation in the the creates and executes another 1-step flow. This flow will create a file on the local filesystem and send it to our remote server.
Before executing this file, please create Globus Endpoint at your machine following the instructions at:
https://www.globus.org/globus-connect-personal
after the installation you will receive a UUID of your machine (which is now astorage endpoint
) and update line 38 with your own uuid.
local_endpoint_id = 'cde22510-5de7-11ec-9b5c-f9dfb1abb183'
Now go ahead and execute the code.
./simple_clients/example_client_transfer.py
Notice how in this case the input
is different. Now defining the source_id
, source_path
, remote_id
,remote_path
.
flow_input = {
'input': {
#local server information
'simple_transfer_source_endpoint_id': local_endpoint_id,
'simple_transfer_source_path': os.path.expanduser(args.dir),
#remote server information
'simple_transfer_destination_endpoint_id':'ef4203ca-6510-466c-9bff-a5d2cc316673',
'simple_transfer_destination_path':'/demo/animals/',
}
}
You can see the files transfered on the globus app, the code will also print a direct link showing both your endpoint and the remote one.
Our Third 1-step example can be found at simple_clients/example_client_publish.py
. It do a simple operation of publishing some data into a globus index.
As before, it requires the setup of a globus index
and how to visualize it.
pip install globus-search-cli
globus-search login
globus-search index create example-index gladier-example-index
The result is a new search index on the globus-search database which will serve as a "repository" for the flow data.
{ "@datatype": "GSearchIndex",
"@version": "2017-09-01",
"creation_date": "2022-04-27 21:04:30",
"description": "gladier-example-index",
"display_name": "example-index",
"id": "563c3d98-6fa8-4ef5-83e2-0f378efe0a5f",
...
}
The search index id 563c3d98-6fa8-4ef5-83e2-0f378efe0a5f
will be used so the flow knows where to send metadata too.
To execute our simple publish client
./example_client/example_client_publish.py
To check if the data went to the index try this check https://acdc.alcf.anl.gov/globus-tutorial/563c3d98-6fa8-4ef5-83e2-0f378efe0a5f
At this point. You have the tools to create 1-step flows. We suggest you play with the examples and create your own python functions. Some ideas:
- Create your own Globus Compute endpoint and try executing functions at your own machine. One simple way to see where is being executed is to create a file at your desktop.
- Now try transfering this file into our remote server
- And retrieving the information of this file and adding to your search index.
- Try downloading the file from the globusApp
Most flows will be a combination of the steps above. Our next step is to merge all of then into a single flow.
For simplicity, we start a new client in /full_client
with a new set of /tools
. This time, we will focus on how to execute it all at once.
The idea is to create the infrastructure based on the simple_examples and expand it into a flow with:
- Transfer
- Compute
- Compute
- Publish
As you noticed above, each 1-step flow have different payloads. Now, all the input
variables need to be defined at once.
Best practice the name of each variable on the payload should match the input name of the variables on the Globus Compute functions. Keeping then separate and well commented on the client makes debugging much faster for you and others.
Auto creating the flow definitions is great but sometimes you may want to change variables like "Timeouts" or which endpoint will execute each function
Best practice for machines that require
queieng
we can define two endpointscompute_endpoint
as the main endpoint and a secondcompute_nonqueue_endpoint
for lighter operations. Note that since they live in the machine, they share the sameglobus_endpoint
and therefore can access the same files. You can use thenonqueue
endpoint for compute functions which do not require lots of compute power.
Functions may already been created by you or others. You can importa a GladierBaseTool
from a different package and include it directly into your flow. Gladier will take care of registering the Globus Compute definition with your username.
Tip Gladier-Tools already have lots of tools for posix, transfer and publish operations. In special,
Publish
is very useful to automatically create and ingest metadata based on your experiment. Lets explore it further below.
Our flow now is defined on an executable file and can be integrated with any CLI based execution method.
Since our Single Instance Client defines a run_flow
function that only needs a set defined set of variables to run.
We can create a FileWatcher that creates this payload whenever a new file is created/modified.
The watcher itself does not carry any information about the flow.
Lets explore what was covered on the example before.