Skip to content

Generating MFC Images and Testing Them on OSPool #935

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 96 commits into
base: master
Choose a base branch
from

Conversation

Malmahrouqi3
Copy link
Collaborator

@Malmahrouqi3 Malmahrouqi3 commented Jul 11, 2025

User description

Description

Closes #654

Generating four images CPU, CPU_Benchmark, GPU, and GPU_Benchmark. All MFC builds occur on a GitHub runner, while testing and storing latest images take place on OSPOOL. They are retrievable on the CI itself as the images are pre-built MFC with pre-installed packages that can be accessed with simple commands.

Debugging info,
To locally generate images, apptainer build mfc_cpu.sif Singularity.cpu
To start shell instance, apptainer shell --fakeroot --writable-tmpfs mfc_cpu.sif
To execute directly specific commands, apptainer exec --fakeroot --writable-tmpfs mfc_cpu.sif /bin/bash -c './mfc.sh test -a'
To download container images, install pelican,
then run pelican object get osdf:///ospool/ap40/data/<user>/<image>.sif <local dir>
e.g. pelican object get osdf:///ospool/ap40/data/mohammed.al-mahrouqi/mfc_gpu.sif ~/Desktop
It would require login credentials from any cilogon.org college/institute.

To-dos,

  • Proper packages and base container for each recipe.
  • htcondor test script to request specific allocations per image.
  • Sanity-check by using the images on various resources/clusters.
  • Maintainer triggered if needed, otherwise most recent images will only be hosted.

Note to Self: current secrets are hosted in the fork, and prior to merge new dedicated ones should be added to the base repo. To do so, request access point under "GATech_Bryngelson" project, then upload public SSH key to https://registry.cilogon.org/. Later on, update secrets which include private SSH key and user@host.

Ref's
NVIDIA Container


PR Type

Other


Description

  • Remove existing CI workflows and testing infrastructure

  • Add Singularity container image building workflow

  • Create four container definitions for CPU/GPU variants

  • Implement automated image building and testing on OSPool


Changes diagram

flowchart LR
  A["Old CI Workflows"] -- "removed" --> B["Deleted Files"]
  C["New Container Workflow"] -- "builds" --> D["Singularity Images"]
  D -- "stores on" --> E["OSPool"]
  F["Container Definitions"] -- "defines" --> G["CPU/GPU Variants"]
Loading

Changes walkthrough 📝

Relevant files
Miscellaneous
17 files
build.sh
Remove Frontier build script                                                         
+0/-9     
submit.sh
Remove Frontier job submission script                                       
+0/-56   
test.sh
Remove Frontier test script                                                           
+0/-10   
bench.sh
Remove Phoenix benchmark script                                                   
+0/-20   
submit-bench.sh
Remove Phoenix benchmark submission script                             
+0/-64   
submit.sh
Remove Phoenix job submission script                                         
+0/-64   
test.sh
Remove Phoenix test script                                                             
+0/-21   
bench.yml
Remove benchmark workflow                                                               
+0/-68   
cleanliness.yml
Remove code cleanliness workflow                                                 
+0/-127 
coverage.yml
Remove coverage check workflow                                                     
+0/-48   
docs.yml
Remove documentation workflow                                                       
+0/-76   
formatting.yml
Remove formatting check workflow                                                 
+0/-19   
line-count.yml
Remove line count workflow                                                             
+0/-54   
lint-source.yml
Remove source linting workflow                                                     
+0/-51   
lint-toolchain.yml
Remove toolchain linting workflow                                               
+0/-17   
spelling.yml
Remove spell check workflow                                                           
+0/-17   
test.yml
Remove main test suite workflow                                                   
+0/-131 
Enhancement
5 files
container-image.yml
Add Singularity image building workflow                                   
+63/-0   
Singularity.cpu
Add CPU container definition                                                         
+24/-0   
Singularity.cpu_bench
Add CPU benchmark container definition                                     
+27/-0   
Singularity.gpu
Add GPU container definition                                                         
+34/-0   
Singularity.gpu_bench
Add GPU benchmark container definition                                     
+32/-0   

Need help?
  • Type /help how to ... in the comments thread for any questions about Qodo Merge usage.
  • Check out the documentation for more information.
  • @Malmahrouqi3
    Copy link
    Collaborator Author

    Malmahrouqi3 commented Jul 21, 2025

    Status Update: I overslept on this. The concept in itself works as supposed to. The persisting hurdle has been no space left on disk for GPU images as Nvidia HPC base container is like 5-8 GB. Clearing the cache wherever (GH runner or OSPOOL) is obsolete. I tried different base containers and recipe instructions but no shot.

    GPU Base Container: nvcr.io/nvidia/nvhpc:23.11-devel-cuda12.3-ubuntu22.04 is downgraded as one of cuda tools was depreciated after cuda12.3. While ubuntu22.04 is the latest OS with a recent python version. nvhpc 23.11 is capable of compiling mfc with no compromises.

    New Approach

    Build: make the process on self-hosted Phoenix.
    Test: transfer images with scp. Condor batch job files are already configured. CPU images would run on different CPU models & GPU images on different GPUs.
    Migrate: transfer images to larger storage on OSDF - allowing images to be publicly accessed.
    Enhance: add flag for local build e.g. ./mfc.sh build --<Build Instructions> --image <Singularity File> <Image Name> i.e. recipes become simpler as they just copy the entire mfc dir with its custom build instructions. Then, it would enclose it into a container image.
    Employ: NVIDIA HPC-Benchmarks base container to acquire benchmarking results, and I would just experiment with it for a tad some time.

    @sbryngelson
    Copy link
    Member

    sbryngelson commented Jul 21, 2025

    These folks build all of nvhpc + openmpi + cuda https://github.com/link89/github-action-demo/blob/cp2k-with-deepmd/cp2k/2025.1-cuda124-openmpi-avx512-psmp/build.sh, so far as I can tell, into a Docker image using a standard GH runner. Can we just try this more simple approach first? It seems that some issues here are the attempt to get it done in one shot. Perhaps try something easy first, then add complexity.

    For example, building a simple gnu+mpi docker container that has MFC in it. We could even use this as an example for new users so they can get up and running without worrying about dependencies on their system.

    @Malmahrouqi3
    Copy link
    Collaborator Author

    These folks build all of nvhpc + openmpi + cuda https://github.com/link89/github-action-demo/blob/cp2k-with-deepmd/cp2k/2025.1-cuda124-openmpi-avx512-psmp/build.sh, so far as I can tell, into a Docker image using a standard GH runner. Can we just try this more simple approach first? It seems that some issues here are the attempt to get it done in one shot. Perhaps try something easy first, then add complexity.

    For example, building a simple gnu+mpi docker container that has MFC in it. We could even use this as an example for new users so they can get up and running without worrying about dependencies on their system.

    Ohh, interesting, I will try this approach and see if it can compile and run MFC on gh runner. Converting between Docker & Singularity is not even something to worry about anyways.
    gnu+mpi docker container should be easy to make honestly. I will add a container image recipe for that.

    @Malmahrouqi3
    Copy link
    Collaborator Author

    Malmahrouqi3 commented Jul 22, 2025

    Sorta figured it out with --tmpdir /tmp/mfc_tmp
    Updated PR description on how to retrieve container images. Afterwards, you can use them to run example/tests instantaneously.

    Edit (Note to Self):
    Stats to display

    for f in *.log; do
      echo "File: $f"
      awk '/SlotName:/, /Memory =/ {block = block $0 ORS} /Memory =/ {last = block; block=""} END {print last}' "$f"
      awk '/Partitionable Resources/,/TimeSlotBusy \(s\)/ {block = block $0 ORS} /TimeSlotBusy \(s\)/ {last = block; block=""} END {if(last) print last}' "$f"
    done
    

    e.g.

    File: mfc_cpu_bench_12644463_0.log
            SlotName: slot1_1@[email protected]
            CondorScratchDir = "/scratch/local/osguserg/305559/glide_UuCd5P/execute/dir_2134154"
            Cpus = 12
            Disk = 8394329
            GLIDEIN_ResourceName = "Utah-Granite-CE1"
            GPUs = 0
            Memory = 8192
    
            Partitionable Resources :    Usage  Request Allocated 
               Cpus                 :                12        12 
               Disk (KB)            :        1  8388608   8389632 
               GPUs                 :                           0 
               Memory (MB)          :              8192      8192 
               TimeSlotBusy (s)     :       53                    
    

    Log files used to trigger failure if *.out/.err contains keyword 'Error'.
    Post each run, clear all corresponding .log/.err/.out files.
    If all passed, then migrate images to OSDF to update MFC container images.

    Corresponding log numbers can be referenced with

    read CPU CPU_BENCH GPU GPU_BENCH< <(ls -1v *.log | sed -E 's/.*_([0-9]+)_[0-9]+\.log/\1/' | awk '!seen[$0]++' | head -n4 | xargs)
    echo "Logs CPU: $CPU, CPU_BENCH: $CPU_BENCH, GPU: $GPU, GPU_BENCH: $GPU_BENCH"
    

    Wait for each job instance using the following

    for cluster in $CPU $CPU_BENCH $GPU $GPU_BENCH; do
      for log in *${cluster}_*.log; do
        echo "Waiting for $log"
        condor_wait "$log"
      done
    done
    

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Development

    Successfully merging this pull request may close these issues.

    [Container] Instructions on how to use on a node that doesn't have access to the internet
    2 participants