Note that you should build the images first via build. We provide a simple, reproducible build for ubuntu 24.04 with Singularity, Flux, ORAS, and updated drivers. We also provide that same logic in Docker containers built alongside this repository and from docker.
You'll first need to export the image full identifier to the environment:
export TF_VAR_vm_image_storage_reference=/subscriptions/3e173a37-8f81-492f-a234-ca727b72e6f8/resourceGroups/packer-testing/providers/Microsoft.Compute/images/flux-framework-2404
Note that I needed to clone this and do from the cloud shell in the Azure portal.
git clone https://github.com/converged-computing/flux-tutorials
cd flux-tutorials/tutorial/azure
Check the start-script.sh and variables at the top of main.tf (e.g., customize the size and other parameters) and then:
make
The shell can be buggy - if it seems like it's hanging, it's that terraform is waiting for you to enter "yes." You can type it (despite not seeing it) and press enter and it works every time... 50% of the time. :) I added a command to the Makefile to get around this:
make apply-approved
You can also run each command separately:
# Terraform init
make init
# Terraform validate
make validate
# Create (one of the below)
make apply
make apply-approved
# Destroy
make destroy
When it's done, save the public and private key to local files:
terraform output -json public_key | jq -r > id_azure.pub
terraform output -json private_key | jq -r > id_azure
chmod 600 id_azure*
Then get the instance ip addresses from the command line (or portal), and ssh in!
ip_address=$(az vmss list-instance-public-ips -g terraform-testing -n flux | jq -r .[0].ipAddress)
ssh -i ./id_azure azureuser@${ip_address}
To get a difference instance, just use the index (e.g., index 1 is the second instance)
follower_address=$(az vmss list-instance-public-ips -g terraform-testing -n flux | jq -r .[1].ipAddress)
ssh -i ./id_azure azureuser@${follower_address}
Note that if the lead broker doesn't come up as flux_0 (flux with all zeros, Azure is not predicable like that) we will need to update.
lead_broker=$(az vmss list-instances -g terraform-testing -n flux | jq -r .[0].osProfile.computerName)
echo "The lead broker is ${lead_broker}"
To run in parallel, let's write a list of hosts, and then issue the command
for address in $(az vmss list-instance-public-ips -g terraform-testing -n flux | jq -r .[].ipAddress)
do
echo "azureuser@$address" >> hosts.txt
done
git clone https://github.com/lilydjwg/pssh /tmp/pssh
export PATH=/tmp/pssh/bin:$PATH
Here is how you can fix all your brokers (this is only necessary if the lead broker ip_address
is not flux000000:
for address in $(az vmss list-instance-public-ips -g terraform-testing -n flux | jq -r .[].ipAddress)
do
echo "Updating $address"
scp -i ./id_azure update_brokers.sh azureuser@${address}:/tmp/update_brokers.sh
# This is what the command would look like in serial
# ssh -i ./id_azure azureuser@$address "/bin/bash /tmp/update_brokers.sh flux $lead_broker"
done
# This is done in parallel
pssh -h hosts.txt -x "-i ./id_azure" "/bin/bash /tmp/update_brokers.sh flux $lead_broker"
Note that I've also provided scripts to install the OSU benchmarks and lammps with the same strategy above:
# Choose the script you want to install
script=install_osu.sh
script=install_lammps.sh
And then install!
for address in $(az vmss list-instance-public-ips -g terraform-testing -n flux | jq -r .[].ipAddress)
do
echo "Updating $address"
scp -i ./id_azure ./install/${script} azureuser@${address}:/tmp/${script}
done
pssh -h hosts.txt -x "-i ./id_azure" "/bin/bash /tmp/${script}"
This installs to /usr/local/libexec/osu-micro-benchmarks/mpi
. And lammps installs to /usr/bin/lmp
Check the cluster status, the overlay status, and try running a job:
flux resource list
flux run -N 2 hostname
Try running a benchmark!
You can export these once and they will be passed into Flux and Singularity containers. We don't build them into the images so you are aware of them and can make an informed choice (change them, etc.)
export OMPI_MCA_btl_openib_warn_no_device_params_found=0
export OMPI_MCA_btl_vader_single_copy_mechanism=none
export OMPI_MCA_btl_openib_allow_ib=1
export UCX_TLS=ib,shm
# You can also do TLS=all
export UCX_NET_DEVICES=mlx5_0:1
flux run -N2 -n 2 -o cpu-affinity=per-task /usr/local/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency
flux run -N2 -n 192 /usr/local/libexec/osu-micro-benchmarks/mpi/collective/osu_allreduce
# OSU MPI Latency Test v5.8
# Size Latency (us)
0 1.61
1 1.60
2 1.60
4 1.61
8 1.61
16 1.61
32 1.75
64 1.80
128 1.84
256 2.35
512 2.44
1024 2.59
2048 2.77
4096 3.52
8192 4.04
16384 5.34
32768 6.77
65536 9.24
131072 13.89
262144 17.26
524288 27.93
1048576 49.90
2097152 91.88
4194304 177.12
You can decrease the problem size for a faster run (x,y,z parameters).
cd /tmp/lammps/examples/reaxff/HNS
# 8x16x16 is about 37 seconds, 16^3 is ~1:07
flux run -N2 -n 192 -o cpu-affinity=per-task lmp -v x 8 -v y 16 -v z 16 -in in.reaxff.hns -nocite
LAMMPS output
LAMMPS (17 Apr 2024 - Development - a8687b5)
OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:98)
using 1 OpenMP thread(s) per MPI task
Reading data file ...
triclinic box = (0 0 0) to (22.326 11.1412 13.778966) with tilt (0 -5.02603 0)
8 by 4 by 6 MPI processor grid
reading atoms ...
304 atoms
reading velocities ...
304 velocities
read_data CPU = 0.012 seconds
Replication is creating a 8x16x16 = 2048 times larger system...
triclinic box = (0 0 0) to (178.608 178.2592 220.46346) with tilt (0 -80.41648 0)
4 by 6 by 8 MPI processor grid
bounding box image = (0 -1 -1) to (0 1 1)
bounding box extra memory = 0.03 MB
average # of replicas added to proc = 48.64 out of 2048 (2.38%)
622592 atoms
replicate CPU = 0.005 seconds
Neighbor list info ...
update: every = 20 steps, delay = 0 steps, check = no
max neighbors/atom: 2000, page size: 100000
master list distance cutoff = 11
ghost atom cutoff = 11
binsize = 5.5, bins = 48 33 41
2 neighbor lists, perpetual/occasional/extra = 2 0 0
(1) pair reaxff, perpetual
attributes: half, newton off, ghost
pair build: half/bin/ghost/newtoff
stencil: full/ghost/bin/3d
bin: standard
(2) fix qeq/reax, perpetual, copy from (1)
attributes: half, newton off
pair build: copy
stencil: none
bin: none
Setting up Verlet run ...
Unit style : real
Current step : 0
Time step : 0.1
Per MPI rank memory allocation (min/avg/max) = 252.5 | 252.7 | 253 Mbytes
Step Temp PotEng Press E_vdwl E_coul Volume
0 300 -113.27833 439.01464 -111.57687 -1.7014647 7019230
10 300.82459 -113.28061 818.23773 -111.57918 -1.7014335 7019230
20 302.60711 -113.2858 1779.7064 -111.58448 -1.7013214 7019230
30 302.90619 -113.28656 4424.6361 -111.58547 -1.701093 7019230
40 301.12001 -113.28117 6444.965 -111.5804 -1.7007665 7019230
50 297.98897 -113.27178 6568.4529 -111.57138 -1.7004009 7019230
60 295.18676 -113.26338 6325.9237 -111.56334 -1.7000345 7019230
70 294.84699 -113.26231 6840.651 -111.56264 -1.6996686 7019230
80 297.64748 -113.27065 8213.699 -111.57135 -1.6993062 7019230
90 301.45139 -113.28199 9328.5706 -111.58301 -1.6989859 7019230
100 302.49959 -113.28506 10225.066 -111.5863 -1.6987587 7019230
Loop time of 36.4598 on 192 procs for 100 steps with 622592 atoms
Performance: 0.024 ns/day, 1012.773 hours/ns, 2.743 timesteps/s, 1.708 Matom-step/s
100.0% CPU use with 192 MPI tasks x 1 OpenMP threads
MPI task timing breakdown:
Section | min time | avg time | max time |%varavg| %total
---------------------------------------------------------------
Pair | 20.824 | 23.166 | 25.392 | 16.3 | 63.54
Neigh | 0.36547 | 0.37331 | 0.37996 | 0.5 | 1.02
Comm | 0.11136 | 1.8392 | 4.6777 | 67.7 | 5.04
Output | 0.0010649 | 0.098938 | 0.20956 | 17.5 | 0.27
Modify | 10.573 | 10.98 | 11.841 | 13.0 | 30.11
Other | | 0.00211 | | | 0.01
Nlocal: 3242.67 ave 3264 max 3216 min
Histogram: 10 26 23 5 0 5 38 50 30 5
Nghost: 12107.3 ave 12136 max 12071 min
Histogram: 6 12 23 11 16 23 34 29 33 5
Neighs: 1.07023e+06 ave 1.07661e+06 max 1.06257e+06 min
Histogram: 13 27 21 2 2 11 37 50 24 5
Total # of neighbors = 2.0548396e+08
Ave neighs/atom = 330.04593
Neighbor list builds = 5
Dangerous builds not checked
Total wall time: 0:00:37
You can pull Singularity containers to run the same tests, but in containers.
flux exec --rank 0-1 singularity pull docker://ghcr.io/converged-computing/flux-tutorials:azure-2404-lammps-reax
flux exec --rank 0-1 singularity pull docker://ghcr.io/converged-computing/flux-tutorials:azure-2404-osu
# OSU Benchmarks
flux run -N2 -n 192 -o cpu-affinity=per-task singularity exec --bind /opt/run/flux ./flux-tutorials_azure-2404-osu.sif /opt/osu-benchmark/build.openmpi/mpi/collective/osu_allreduce
flux run -N2 -n 2 -o cpu-affinity=per-task singularity exec --bind /opt/run/flux ./flux-tutorials_azure-2404-osu.sif /opt/osu-benchmark/build.openmpi/mpi/pt2pt/osu_latency
# LAMMPS (1:06 to 1:07)
flux run -o cpu-affinity=per-task -N2 -n 192 singularity exec --bind /opt/run/flux ./flux-tutorials_azure-2404-lammps-reax.sif /usr/bin/lmp -v x 16 -v y 16 -v z 16 -in in.reaxff.hns -nocite
See flux-usernetes for build and deploy instructions for deployment of user space Kubernetes.
This should work (but see debugging).
make destroy
But if not, you can either delete the resource group from the console, or the command line:
az group delete --name terraform-testing
Note that this current build does not have flux-pmix, which might lead to issues with MPI. It's an issue of the VM base being compiled with a libpmix.so that has a different ABI than what flux is expecting. I will be looking into it.
Here is various output about the environment, collected on January 9, 2024.
ucx_info -d
UCX info output
#
# Memory domain: self
# Component: self
# register: unlimited, cost: 0 nsec
# remote key: 0 bytes
# rkey_ptr is supported
# memory types: host (access,reg_nonblock,reg,cache)
#
# Transport: self
# Device: memory
# Type: loopback
# System device: <unknown>
#
# capabilities:
# bandwidth: 0.00/ppn + 19360.00 MB/sec
# latency: 0 nsec
# overhead: 10 nsec
# put_short: <= 4294967295
# put_bcopy: unlimited
# get_bcopy: unlimited
# am_short: <= 8K
# am_bcopy: <= 8K
# domain: cpu
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to iface
# device priority: 0
# device num paths: 1
# max eps: inf
# device address: 0 bytes
# iface address: 8 bytes
# error handling: ep_check
#
#
# Memory domain: tcp
# Component: tcp
# register: unlimited, cost: 0 nsec
# remote key: 0 bytes
# memory types: host (access,reg_nonblock,reg,cache)
#
# Transport: tcp
# Device: eth0
# Type: network
# System device: <unknown>
#
# capabilities:
# bandwidth: 2200.00/ppn + 0.00 MB/sec
# latency: 5212 nsec
# overhead: 50000 nsec
# put_zcopy: <= 18446744073709551590, up to 6 iov
# put_opt_zcopy_align: <= 1
# put_align_mtu: <= 0
# am_short: <= 8K
# am_bcopy: <= 8K
# am_zcopy: <= 64K, up to 6 iov
# am_opt_zcopy_align: <= 1
# am_align_mtu: <= 0
# am header: <= 8037
# connection: to ep, to iface
# device priority: 0
# device num paths: 1
# max eps: 256
# device address: 6 bytes
# iface address: 2 bytes
# ep address: 10 bytes
# error handling: peer failure, ep_check, keepalive
#
# Transport: tcp
# Device: lo
# Type: network
# System device: <unknown>
#
# capabilities:
# bandwidth: 11.91/ppn + 0.00 MB/sec
# latency: 10960 nsec
# overhead: 50000 nsec
# put_zcopy: <= 18446744073709551590, up to 6 iov
# put_opt_zcopy_align: <= 1
# put_align_mtu: <= 0
# am_short: <= 8K
# am_bcopy: <= 8K
# am_zcopy: <= 64K, up to 6 iov
# am_opt_zcopy_align: <= 1
# am_align_mtu: <= 0
# am header: <= 8037
# connection: to ep, to iface
# device priority: 1
# device num paths: 1
# max eps: 256
# device address: 18 bytes
# iface address: 2 bytes
# ep address: 10 bytes
# error handling: peer failure, ep_check, keepalive
#
#
# Connection manager: tcp
# max_conn_priv: 2064 bytes
#
# Memory domain: sysv
# Component: sysv
# allocate: unlimited
# remote key: 12 bytes
# rkey_ptr is supported
# memory types: host (access,alloc,cache)
#
# Transport: sysv
# Device: memory
# Type: intra-node
# System device: <unknown>
#
# capabilities:
# bandwidth: 0.00/ppn + 15360.00 MB/sec
# latency: 80 nsec
# overhead: 10 nsec
# put_short: <= 4294967295
# put_bcopy: unlimited
# get_bcopy: unlimited
# am_short: <= 100
# am_bcopy: <= 8256
# domain: cpu
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to iface
# device priority: 0
# device num paths: 1
# max eps: inf
# device address: 8 bytes
# iface address: 8 bytes
# error handling: ep_check
#
#
# Memory domain: posix
# Component: posix
# allocate: <= 235262556K
# remote key: 24 bytes
# rkey_ptr is supported
# memory types: host (access,alloc,cache)
#
# Transport: posix
# Device: memory
# Type: intra-node
# System device: <unknown>
#
# capabilities:
# bandwidth: 0.00/ppn + 15360.00 MB/sec
# latency: 80 nsec
# overhead: 10 nsec
# put_short: <= 4294967295
# put_bcopy: unlimited
# get_bcopy: unlimited
# am_short: <= 100
# am_bcopy: <= 8256
# domain: cpu
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to iface
# device priority: 0
# device num paths: 1
# max eps: inf
# device address: 8 bytes
# iface address: 8 bytes
# error handling: ep_check
#
#
# Memory domain: mlx5_0
# Component: ib
# register: unlimited, dmabuf, cost: 16000 + 0.060 * N nsec
# remote key: 8 bytes
# local memory handle is required for zcopy
# memory invalidation is supported
# memory types: host (access,reg,cache)
#
# Transport: dc_mlx5
# Device: mlx5_0:1
# Type: network
# System device: mlx5_0 (0)
#
# capabilities:
# bandwidth: 23588.47/ppn + 0.00 MB/sec
# latency: 660 nsec
# overhead: 40 nsec
# put_short: <= 172
# put_bcopy: <= 8256
# put_zcopy: <= 1G, up to 11 iov
# put_opt_zcopy_align: <= 512
# put_align_mtu: <= 4K
# get_bcopy: <= 8256
# get_zcopy: 65..1G, up to 11 iov
# get_opt_zcopy_align: <= 512
# get_align_mtu: <= 4K
# am_short: <= 186
# am_bcopy: <= 8254
# am_zcopy: <= 8254, up to 3 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 4K
# am header: <= 138
# domain: device
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to iface
# device priority: 50
# device num paths: 1
# max eps: inf
# device address: 5 bytes
# iface address: 7 bytes
# error handling: buffer (zcopy), remote access, peer failure, ep_check
#
#
# Transport: rc_verbs
# Device: mlx5_0:1
# Type: network
# System device: mlx5_0 (0)
#
# capabilities:
# bandwidth: 23588.47/ppn + 0.00 MB/sec
# latency: 600 + 1.000 * N nsec
# overhead: 75 nsec
# put_short: <= 124
# put_bcopy: <= 8256
# put_zcopy: <= 1G, up to 5 iov
# put_opt_zcopy_align: <= 512
# put_align_mtu: <= 4K
# get_bcopy: <= 8256
# get_zcopy: 65..1G, up to 5 iov
# get_opt_zcopy_align: <= 512
# get_align_mtu: <= 4K
# am_short: <= 123
# am_bcopy: <= 8255
# am_zcopy: <= 8255, up to 4 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 4K
# am header: <= 127
# domain: device
# atomic_add: 64 bit
# atomic_fadd: 64 bit
# atomic_cswap: 64 bit
# connection: to ep
# device priority: 50
# device num paths: 1
# max eps: 256
# device address: 5 bytes
# ep address: 7 bytes
# error handling: peer failure, ep_check
#
#
# Transport: rc_mlx5
# Device: mlx5_0:1
# Type: network
# System device: mlx5_0 (0)
#
# capabilities:
# bandwidth: 23588.47/ppn + 0.00 MB/sec
# latency: 600 + 1.000 * N nsec
# overhead: 40 nsec
# put_short: <= 220
# put_bcopy: <= 8256
# put_zcopy: <= 1G, up to 14 iov
# put_opt_zcopy_align: <= 512
# put_align_mtu: <= 4K
# get_bcopy: <= 8256
# get_zcopy: 65..1G, up to 14 iov
# get_opt_zcopy_align: <= 512
# get_align_mtu: <= 4K
# am_short: <= 234
# am_bcopy: <= 8254
# am_zcopy: <= 8254, up to 3 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 4K
# am header: <= 186
# domain: device
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to ep
# device priority: 50
# device num paths: 1
# max eps: 256
# device address: 5 bytes
# ep address: 10 bytes
# error handling: buffer (zcopy), remote access, peer failure, ep_check
#
#
# Transport: ud_verbs
# Device: mlx5_0:1
# Type: network
# System device: mlx5_0 (0)
#
# capabilities:
# bandwidth: 23588.47/ppn + 0.00 MB/sec
# latency: 630 nsec
# overhead: 105 nsec
# am_short: <= 116
# am_bcopy: <= 4088
# am_zcopy: <= 4088, up to 5 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 4K
# am header: <= 3992
# connection: to ep, to iface
# device priority: 50
# device num paths: 1
# max eps: inf
# device address: 5 bytes
# iface address: 3 bytes
# ep address: 6 bytes
# error handling: peer failure, ep_check
#
#
# Transport: ud_mlx5
# Device: mlx5_0:1
# Type: network
# System device: mlx5_0 (0)
#
# capabilities:
# bandwidth: 23588.47/ppn + 0.00 MB/sec
# latency: 630 nsec
# overhead: 80 nsec
# am_short: <= 180
# am_bcopy: <= 4088
# am_zcopy: <= 4088, up to 3 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 4K
# am header: <= 132
# connection: to ep, to iface
# device priority: 50
# device num paths: 1
# max eps: inf
# device address: 5 bytes
# iface address: 3 bytes
# ep address: 6 bytes
# error handling: peer failure, ep_check
#
#
# Memory domain: mlx5_0
# Component: gga
# register: unlimited, dmabuf, cost: 16000 + 0.060 * N nsec
# remote key: 8 bytes
# local memory handle is required for zcopy
# memory invalidation is supported
# memory types: host (access,reg,cache)
# < no supported devices found >
#
# Connection manager: rdmacm
# max_conn_priv: 54 bytes
#
# Memory domain: cma
# Component: cma
# register: unlimited, cost: 9 nsec
# memory types: host (access,reg_nonblock,reg,cache)
#
# Transport: cma
# Device: memory
# Type: intra-node
# System device: <unknown>
#
# capabilities:
# bandwidth: 0.00/ppn + 11145.00 MB/sec
# latency: 80 nsec
# overhead: 2000 nsec
# put_zcopy: unlimited, up to 16 iov
# put_opt_zcopy_align: <= 1
# put_align_mtu: <= 1
# get_zcopy: unlimited, up to 16 iov
# get_opt_zcopy_align: <= 1
# get_align_mtu: <= 1
# connection: to iface
# device priority: 0
# device num paths: 1
# max eps: inf
# device address: 8 bytes
# iface address: 4 bytes
# error handling: peer failure, ep_check
#
#
# Memory domain: knem
# Component: knem
# register: unlimited, cost: 1200 + 0.007 * N nsec
# remote key: 16 bytes
# memory types: host (access,reg,cache)
#
# Transport: knem
# Device: memory
# Type: intra-node
# System device: <unknown>
#
# capabilities:
# bandwidth: 0.00/ppn + 13862.00 MB/sec
# latency: 80 nsec
# overhead: 2000 nsec
# put_zcopy: unlimited, up to 16 iov
# put_opt_zcopy_align: <= 1
# put_align_mtu: <= 1
# get_zcopy: unlimited, up to 16 iov
# get_opt_zcopy_align: <= 1
# get_align_mtu: <= 1
# connection: to iface
# device priority: 0
# device num paths: 1
# max eps: inf
# device address: 8 bytes
# iface address: 0 bytes
# error handling: none
#
This looks to be memory copy bandwidth:
$ ucx_info -M
# Using built-in memcpy() for size inf..inf
# Memcpy bandwidth:
# 4096 bytes: 76386.180 MB/s
# 8192 bytes: 89185.127 MB/s
# 16384 bytes: 93675.259 MB/s
# 32768 bytes: 53620.904 MB/s
# 65536 bytes: 51693.470 MB/s
# 131072 bytes: 51912.292 MB/s
# 262144 bytes: 48203.195 MB/s
# 524288 bytes: 43202.249 MB/s
# 1048576 bytes: 36308.450 MB/s
# 2097152 bytes: 36124.949 MB/s
# 4194304 bytes: 36190.920 MB/s
# 8388608 bytes: 36184.046 MB/s
# 16777216 bytes: 36257.343 MB/s
# 33554432 bytes: 36221.641 MB/s
# 67108864 bytes: 36208.731 MB/s
# 134217728 bytes: 29613.116 MB/s
# 268435456 bytes: 27458.495 MB/s
Device info:
$ ibv_devinfo
hca_id: mlx5_0
transport: InfiniBand (0)
fw_ver: 20.31.1014
node_guid: 0015:5dff:fe33:ff23
sys_image_guid: 946d:ae03:0068:a6ba
vendor_id: 0x02c9
vendor_part_id: 4124
hw_ver: 0x0
board_id: MT_0000000223
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 1
port_lid: 677
port_lmc: 0x00
link_layer: InfiniBand
$ ibv_devices
device node GUID
------ ----------------
mlx5_0 00155dfffe33ff23
More devices...
$ flux exec -r 0-1 lspci
0101:00:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function]
254f:00:00.0 Non-Volatile memory controller: Microsoft Corporation Device b111
cf71:00:00.0 Non-Volatile memory controller: Microsoft Corporation Device b111
0101:00:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function]
5132:00:00.0 Non-Volatile memory controller: Microsoft Corporation Device b111
759e:00:00.0 Non-Volatile memory controller: Microsoft Corporation Device b111
And networking.
$ ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether 7c:1e:52:11:5d:01 brd ff:ff:ff:ff:ff:ff
3: ibP257s63109: <BROADCAST,MULTICAST> mtu 4092 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/infiniband 00:00:01:48:fe:80:00:00:00:00:00:00:00:15:5d:ff:fd:33:ff:23 brd 00:ff:ff:ff:ff:12:40:1b:80:2a:00:00:00:00:00:00:ff:ff:ff:ff
altname ibP257p0s0
Some software:
$ which mpirun
/usr/local/bin/mpirun
$ mpirun --version
mpirun (Open MPI) 4.1.2
$ gcc --version
gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Copyright (C) 2023 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
For advanced users, we have a docker directory with builds that are the exact same logic, built into containers! You can use the base containers for your own applications.
Depending on your environment, terraform (e.g., make
or make destroy
doesn't always work. I get this error from the Azure Cloud Shell:
terraform destroy
random_pet.id: Refreshing state... [id=usable-grouper]
random_string.fqdn: Refreshing state... [id=lhppiw]
╷
│ Error: building account: could not acquire access token to parse claims: running Azure CLI: exit status 1: ERROR: Failed to connect to MSI. Please make sure MSI is configured correctly.
│ Get Token request returned: <Response [400]>
│
│ with provider["registry.terraform.io/hashicorp/azurerm"],
│ on main.tf line 28, in provider "azurerm":
│ 28: provider "azurerm" {
│
╵
make: *** [Makefile:22: destroy] Error 1
If I open a new cloud shell, it seems to magically go away. But you can also interact with the az
tool (that does seem to to work) or issue commands via clicking directly in the portal.