-
Notifications
You must be signed in to change notification settings - Fork 631
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FDS running too slow on linux clusters #13186
Comments
We need to also see your input file. The output you provide shows you are only using 1 MPI process. Do you only have 1 mesh? |
I think there are 3 meshes |
FDS can only map an MPI process to a single mesh (this is how we do domain decomposition). So, your --ntasks and --ntasks-per-node need to be 3. But something else is not right because you are seeing "Number of MPI Processes: 1". I don't quite understand this. But fix the number to 3 and try again and let us know. |
Try this: Tasks here refers to MPI processes, one per mesh. Note your meshes have significantly different sizes. This is bad for load balance. Your calculation will be as slow as the slowest worker, here the process managing mesh 02. |
You may not be able to use the pre-compiled FDS binaries with your cluster. srun may be linked to a specific mpiexec/mpirun executable that is built with a different compiler than the pre-compiled binaries. When you try to run FDS with the wrong mpiexec sometimes it will spin up N number of series process copies of the same input. In addition to the comments by Randy and Marcos, I would try building the source code yourself with the compiler environment available on your cluster. |
I changed the number to 3 and reran the simulation. Seems like the speed hasn't improved a lot. Here's the new file. Still shows Number of MPI processes = 1 |
I changed the batch file and the speed hasn't impoved by a lot. Would using 3 nodes or tasks that are multiple of 3 help. I wanted to take advantage of the large number of cpus we have in a server. Is there a way I could do that here? Thanks |
The mpiexec or mpirun you are using is from a different compiler than the binary FDS executable. That's why you see three repeats of each of the initialization info. See my previous message on compiling FDS using your cluster's build environment. |
What would building the source code entail? Not very familiar with it |
@johodges Shouldn't things work correctly if the user adds the environment variables to their .bashrc as suggested at the end of the FDS-SMV install process? @shivam11021 Have you installed FDS yourself, or did you have your sys admin do it? |
I did it myself but took the help of the IT people |
Are you loading modules or are you running the FDS6VARS.sh script? |
I just go to the terminal and use the command "bash script.sh". I don't load any modules explicitly |
I have run into this issue before when I tried to use a compiled version of FDS on frontera which also utilizes srun. If you look in the submission file the user is not calling mpiexec/mpirun directly. That means srun is deciding which mpi version to call. |
Sorry for the double post. I just checked again on Frontera. If I source FDS6VARS.sh in my bashrc file I am able to run the compiled version. However, they are using the intel mpiexec under the hood. I think I was misremembering which cluster I ran into this issue. I think it was probably polaris since they use the mpich compiler. @shivam11021 can you run an interactive job on your cluster then check which mpi is being pulled in? You can type "which mpiexec" and it will tell you which file it is. You can also run "mpiexec --version" to see which compiler was used on it. |
This is what I got -- ~/FDS/FDS6/bin/INTEL/bin/mpiexec Intel(R) MPI Library for Linux* OS, Version 2021.6 Build 20220227 (id: 28877f3f32) |
Did you try adding a source command for FDS6VARS? Try submitting this one: #!/bin/bash If that does not work, try submitting this job: #!/bin/bash Then post the mpiver.txt file that is generated. |
@johodges I tried running the first script but it didn't lead to any enhancement in the analysis speed. The second one didn't run FDS, nor did it create the mpiver.txt file. |
When you did this check, were you on one of the compute nodes when you ran the which command? If so, start another interactive job and then run your case manually on the compute node using the same source commands. Let us know if you still see the multiple repeats of each part of the initialization. |
I am testing FDS parallel computing with Linux machine, I find the running speed is not faster than on my laptop machine, any idea? |
FDS speed is function of the hardware (number and types of CPU, amount and type of memory, bus, etc.) and the configuration of the machine (how many user's, cluster or standalone, what other software is running, etc.). It is certainly not that case that every Linux machine is faster than any Windows machine. |
Describe the bug
I recently installed FDS (the latest version) on my university's linux clusters. We have about 4 servers having 132 cpus each. I tried running a simulation on one server using 32 cpus (ntasks). The simulation is taking too long. In fact, it is even slower than my personal laptop. I am new to using FDS and would appreciate if someone could help. The job control script that I am using is attached.
Expected behavior
Was expecting FDS to run at a much faster pace
Screenshots
Here's how the output file looks
script2.txt
The text was updated successfully, but these errors were encountered: