-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LSDMap fails for >64K configurations #3
Comments
Hi Vivek, it seems to be an internal MPI-related error, I remember we had a MPI issue some times ago on Stampede, or was it on Archer? For some reasons, it was running OK with a smaller number of CPUs. Could you try running your command with more or less CPU's to see if you still get the same issue (e.g. 16, 32 and 128 CPUs) |
32K configuration to 64K configuration requires a >4-fold increase in memory. I recall seeing segmentation faults when it attempts to use too much memory to construct the LSDMap. You should try running with more CPUs to see if it's okay. |
I ran lsdmap for a file with 64K configurations with 32 cores and ran into the same error. For 64K configurations, when I used 128 cores. It reports the following error: https://gist.github.com/vivek-bala/da7423cd35a2ea80ad00. But in this case, it has also produced the .eg, .ev and nearest neighbor files. Where the eigen vector file has 64000 lines. |
It was successful with no errors when I used 256 cores. But it took close to an hour to complete. Is that expected ? |
Hi Vivek, How large of a protein are you using? How many atoms are you computing RMSD for? How long did the 32K configurations take? The scaling for LSDMap is O(N^2) time and memory, where N is number of frames. Most of the time and memory is spent computing the Distance and Kernel Matrix. Lorenzo and Cecilia are working on speeding it up, but that is not ready yet. |
I am using 1-alanine amino acid. It has 22 atoms (https://raw.githubusercontent.com/radical-cybertools/radical.ensemblemd/master/usecases/extasy_gromacs_lsdmap/inp_files/input.gro). For 32K configurations on 64 cores, the time taken was 1215 seconds. I doubled the number of configurations are increased the resources by 4 times (to account for O(N^2) behaviour). The time taken seems to increase by 3 times. |
Hi Vivek, That doesn't sound right actually. I'll double check though on our cluster here at Rice and make sure. The processes not scaling with the number of processors is most baffling to me. |
Vivek, could you check the lsdmap.log file to see how much time each step is taking. The O(N^2) behaviour only applies for the computation of the distance matrix, the 3 times difference can be due to the fact that other steps are involved in the overall time. |
For 64K configs:
For 32K configs:
|
Apparently, most of the time is spent between the two last lines of the log file. At that point, we are basically saving the distance matrix and/or nearest neighbors, only if the flags -n or -d are specified. Did you specify any of these options? If not I already ran into this strange problem when I was running LSDMap tests on Archer. I was basically doing nothing between the two last statements of the log file but it was still taking a lot of time. I concluded that somehow the logging module was not working well with many CPUs. If you are not using any of the flags -n and -d, try to comment the last lines of lsdm.py that print the last two statements of the log file to see if there is any difference. |
I use '-n' to name the neighbour file. I thought that was required. I'll try it with the changes you suggested. |
Could you tell me the exact lines to comment out please. |
I think the '-n' flag is only used when running DM-d-MD because you will need to save the nearest neighbors to be able to reweight correctly after selecting the new walkers. However, if you want only to test LSDMap, this option is not mandatory. The lines to comment are lines from 412 to 423 (included) in https://github.com/jp43/lsdmap/blob/master/lsdmap/lsdm.py. However if you use DM-d-MD, simply try commenting the lines 412 and 423 only. In that case, if you still get a slow time between the two last statements of the log file, it would mean that the function "save_nneighbors" takes most of the time. |
I am using the LSDMap installed on Stampede.
I ran lsdmap for a file with 64K configurations.
Script: https://gist.github.com/vivek-bala/954b24a694b52d79350e
It failed with the following error: https://gist.github.com/vivek-bala/312e87d00e1e5273e79d
Whereas it is successful for <=32K configurations.
Am I making a mistake some place ? Maybe the LSDMap is old by several versions, it seems to have been updated in Jan of this year.
The text was updated successfully, but these errors were encountered: