Generating MFC Images and Testing Them on OSPool #935

Malmahrouqi3 · 2025-07-11T15:46:50Z

User description

Description

Closes #654

Generating four images CPU, CPU_Benchmark, GPU, and GPU_Benchmark. All MFC builds occur on a GitHub runner, while testing and storing latest images take place on OSPOOL. They are retrievable on the CI itself as the images are pre-built MFC with pre-installed packages that can be accessed with simple commands.

Debugging info,
To locally generate images, apptainer build mfc_cpu.sif Singularity.cpu
To start shell instance, apptainer shell --fakeroot --writable-tmpfs mfc_cpu.sif
To execute directly specific commands, apptainer exec --fakeroot --writable-tmpfs mfc_cpu.sif /bin/bash -c './mfc.sh test -a'
To download container images, install pelican,
then run pelican object get osdf:///ospool/ap40/data/<user>/<image>.sif <local dir>
e.g. pelican object get osdf:///ospool/ap40/data/mohammed.al-mahrouqi/mfc_gpu.sif ~/Desktop
It would require login credentials from any cilogon.org college/institute.

To-dos,

Proper packages and base container for each recipe.
htcondor test script to request specific allocations per image.
Sanity-check by using the images on various resources/clusters.
Maintainer triggered if needed, otherwise most recent images will only be hosted.

Note to Self: current secrets are hosted in the fork, and prior to merge new dedicated ones should be added to the base repo. To do so, request access point under "GATech_Bryngelson" project, then upload public SSH key to https://registry.cilogon.org/. Later on, update secrets which include private SSH key and user@host.

Ref's
NVIDIA Container

PR Type

Other

Description

Remove existing CI workflows and testing infrastructure
Add Singularity container image building workflow
Create four container definitions for CPU/GPU variants
Implement automated image building and testing on OSPool

Changes diagram

flowchart LR
  A["Old CI Workflows"] -- "removed" --> B["Deleted Files"]
  C["New Container Workflow"] -- "builds" --> D["Singularity Images"]
  D -- "stores on" --> E["OSPool"]
  F["Container Definitions"] -- "defines" --> G["CPU/GPU Variants"]

Changes walkthrough 📝

Relevant files

Miscellaneous

17 files

build.sh `Remove Frontier build script`	+0/-9
submit.sh `Remove Frontier job submission script`	+0/-56
test.sh `Remove Frontier test script`	+0/-10
bench.sh `Remove Phoenix benchmark script`	+0/-20
submit-bench.sh `Remove Phoenix benchmark submission script`	+0/-64
submit.sh `Remove Phoenix job submission script`	+0/-64
test.sh `Remove Phoenix test script`	+0/-21
bench.yml `Remove benchmark workflow`	+0/-68
cleanliness.yml `Remove code cleanliness workflow`	+0/-127
coverage.yml `Remove coverage check workflow`	+0/-48
docs.yml `Remove documentation workflow`	+0/-76
formatting.yml `Remove formatting check workflow`	+0/-19
line-count.yml `Remove line count workflow`	+0/-54
lint-source.yml `Remove source linting workflow`	+0/-51
lint-toolchain.yml `Remove toolchain linting workflow`	+0/-17
spelling.yml `Remove spell check workflow`	+0/-17
test.yml `Remove main test suite workflow`	+0/-131

Enhancement

5 files

container-image.yml `Add Singularity image building workflow`	+63/-0
Singularity.cpu `Add CPU container definition`	+24/-0
Singularity.cpu_bench `Add CPU benchmark container definition`	+27/-0
Singularity.gpu `Add GPU container definition`	+34/-0
Singularity.gpu_bench `Add GPU benchmark container definition`	+32/-0

Need help?
Type /help how to ... in the comments thread for any questions about Qodo Merge usage.
Check out the documentation for more information.

Malmahrouqi3 · 2025-07-21T17:42:14Z

Status Update: I overslept on this. The concept in itself works as supposed to. The persisting hurdle has been no space left on disk for GPU images as Nvidia HPC base container is like 5-8 GB. Clearing the cache wherever (GH runner or OSPOOL) is obsolete. I tried different base containers and recipe instructions but no shot.

GPU Base Container: nvcr.io/nvidia/nvhpc:23.11-devel-cuda12.3-ubuntu22.04 is downgraded as one of cuda tools was depreciated after cuda12.3. While ubuntu22.04 is the latest OS with a recent python version. nvhpc 23.11 is capable of compiling mfc with no compromises.

New Approach

Build: make the process on self-hosted Phoenix.
Test: transfer images with scp. Condor batch job files are already configured. CPU images would run on different CPU models & GPU images on different GPUs.
Migrate: transfer images to larger storage on OSDF - allowing images to be publicly accessed.
Enhance: add flag for local build e.g. ./mfc.sh build --<Build Instructions> --image <Singularity File> <Image Name> i.e. recipes become simpler as they just copy the entire mfc dir with its custom build instructions. Then, it would enclose it into a container image.
Employ: NVIDIA HPC-Benchmarks base container to acquire benchmarking results, and I would just experiment with it for a tad some time.

sbryngelson · 2025-07-21T17:48:16Z

These folks build all of nvhpc + openmpi + cuda https://github.com/link89/github-action-demo/blob/cp2k-with-deepmd/cp2k/2025.1-cuda124-openmpi-avx512-psmp/build.sh, so far as I can tell, into a Docker image using a standard GH runner. Can we just try this more simple approach first? It seems that some issues here are the attempt to get it done in one shot. Perhaps try something easy first, then add complexity.

For example, building a simple gnu+mpi docker container that has MFC in it. We could even use this as an example for new users so they can get up and running without worrying about dependencies on their system.

Malmahrouqi3 · 2025-07-21T19:23:25Z

These folks build all of nvhpc + openmpi + cuda https://github.com/link89/github-action-demo/blob/cp2k-with-deepmd/cp2k/2025.1-cuda124-openmpi-avx512-psmp/build.sh, so far as I can tell, into a Docker image using a standard GH runner. Can we just try this more simple approach first? It seems that some issues here are the attempt to get it done in one shot. Perhaps try something easy first, then add complexity.

For example, building a simple gnu+mpi docker container that has MFC in it. We could even use this as an example for new users so they can get up and running without worrying about dependencies on their system.

Ohh, interesting, I will try this approach and see if it can compile and run MFC on gh runner. Converting between Docker & Singularity is not even something to worry about anyways.
gnu+mpi docker container should be easy to make honestly. I will add a container image recipe for that.

Malmahrouqi3 · 2025-07-22T22:00:13Z

Sorta figured it out with --tmpdir /tmp/mfc_tmp
Updated PR description on how to retrieve container images. Afterwards, you can use them to run example/tests instantaneously.

Edit (Note to Self):
Stats to display

for f in *.log; do
  echo "File: $f"
  awk '/SlotName:/, /Memory =/ {block = block $0 ORS} /Memory =/ {last = block; block=""} END {print last}' "$f"
  awk '/Partitionable Resources/,/TimeSlotBusy \(s\)/ {block = block $0 ORS} /TimeSlotBusy \(s\)/ {last = block; block=""} END {if(last) print last}' "$f"
done

e.g.

File: mfc_cpu_bench_12644463_0.log
        SlotName: slot1_1@[email protected]
        CondorScratchDir = "/scratch/local/osguserg/305559/glide_UuCd5P/execute/dir_2134154"
        Cpus = 12
        Disk = 8394329
        GLIDEIN_ResourceName = "Utah-Granite-CE1"
        GPUs = 0
        Memory = 8192

        Partitionable Resources :    Usage  Request Allocated 
           Cpus                 :                12        12 
           Disk (KB)            :        1  8388608   8389632 
           GPUs                 :                           0 
           Memory (MB)          :              8192      8192 
           TimeSlotBusy (s)     :       53

Log files used to trigger failure if *.out/.err contains keyword 'Error'.
Post each run, clear all corresponding .log/.err/.out files.
If all passed, then migrate images to OSDF to update MFC container images.

Corresponding log numbers can be referenced with

read CPU CPU_BENCH GPU GPU_BENCH< <(ls -1v *.log | sed -E 's/.*_([0-9]+)_[0-9]+\.log/\1/' | awk '!seen[$0]++' | head -n4 | xargs)
echo "Logs CPU: $CPU, CPU_BENCH: $CPU_BENCH, GPU: $GPU, GPU_BENCH: $GPU_BENCH"

Wait for each job instance using the following

for cluster in $CPU $CPU_BENCH $GPU $GPU_BENCH; do
  for log in *${cluster}_*.log; do
    echo "Waiting for $log"
    condor_wait "$log"
  done
done

mohdsaid497566 and others added 30 commits June 14, 2025 20:51

just experimenting (MFlowCode#654)

bb15510

updated sing version

d4875b8

corrected and directed images to be stored in images folder

e0dbc1c

corrected path

eeb1706

chaning path

56ba5a4

another path correction

77c5ce0

added cloning of PR

a12263e

adjusted images commands

6c64ac0

changed folder images=>image

5234667

removed items from Build singularity Images

23fff81

added --fakeroot

f6faafe

commenting to only generate one image

91b0e58

added --sandbox

07ad245

root path

3204a2a

apptainer

999eaf6

removed stuff

7162b30

removed fakeroot flag

303660b

using apptainer instead of github action

7500195

ensuring all images are saved

cac8ca9

user privileges

aade580

singularity files

1ccbb7a

corrected path (MFlowCode#654)

65cff8f

unified singularity files to avoid errors

5940002

some changes

b70278e

experimenet with the workflow file

0728f78

removed all workflow files except containerization

09cf1d7

edit container-image.yml

a86aafd

small edit

edde446

small edit

e0e1106

testing current workflow

165970f

Malmahrouqi3 added 15 commits July 19, 2025 21:59

update list of python packages

06d9412

just experimenting with py packages

055a148

just experimenting with stuff

8352bbd

new gpu container

92f79fa

removed stuff

7064e36

compiler vars set

308a21c

just debugging env var paths

6eb8b1d

updated gpu image and experimenting

bf6014e

updated stuff

5bb059a

env into workflow file

b1bbe27

modified Singularity.gpu

4734eb6

new gpu base container nvhpc:23.11-devel-cuda12.3-ubuntu22.04

a7eb958

updated environment

51c6584

cleanup into workflow

cec680d

added Free up disk space

af7b85d

Malmahrouqi3 and others added 7 commits July 22, 2025 00:57

more cleanup

20c6cc2

included --tmpdir /tmp/mfc_tmp

8769729

restoring all commented commands

ba0adfd

added hpl

9e0d458

added mkdir mfc_tmp for each image

fc71dcb

mkdir MFC

8a07f7d

added hpl package

a3e96b4

mohdsaid497566 and others added 4 commits July 22, 2025 18:39

simplified yaml file

9e4360c

something

4d41695

robust testing plus few fixes

ad11141

better naming

85f8ca7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Generating MFC Images and Testing Them on OSPool #935

Generating MFC Images and Testing Them on OSPool #935

Uh oh!

Malmahrouqi3 commented Jul 11, 2025 •

edited by sbryngelson

Loading

Uh oh!

Malmahrouqi3 commented Jul 21, 2025 •

edited

Loading

Uh oh!

sbryngelson commented Jul 21, 2025 •

edited

Loading

Uh oh!

Malmahrouqi3 commented Jul 21, 2025

Uh oh!

Malmahrouqi3 commented Jul 22, 2025 •

edited

Loading

Uh oh!

Uh oh!

Generating MFC Images and Testing Them on OSPool #935

Are you sure you want to change the base?

Generating MFC Images and Testing Them on OSPool #935

Uh oh!

Conversation

Malmahrouqi3 commented Jul 11, 2025 • edited by sbryngelson Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

User description

Description

PR Type

Description

Changes diagram

Changes walkthrough 📝

Uh oh!

Malmahrouqi3 commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

New Approach

Uh oh!

sbryngelson commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Malmahrouqi3 commented Jul 21, 2025

Uh oh!

Malmahrouqi3 commented Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Malmahrouqi3 commented Jul 11, 2025 •

edited by sbryngelson

Loading

Malmahrouqi3 commented Jul 21, 2025 •

edited

Loading

sbryngelson commented Jul 21, 2025 •

edited

Loading

Malmahrouqi3 commented Jul 22, 2025 •

edited

Loading