Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LSDMap fails for >64K configurations #3

Open
vivek-bala opened this issue Jul 6, 2015 · 13 comments
Open

LSDMap fails for >64K configurations #3

vivek-bala opened this issue Jul 6, 2015 · 13 comments

Comments

@vivek-bala
Copy link

I am using the LSDMap installed on Stampede.

I ran lsdmap for a file with 64K configurations.
Script: https://gist.github.com/vivek-bala/954b24a694b52d79350e
It failed with the following error: https://gist.github.com/vivek-bala/312e87d00e1e5273e79d

Whereas it is successful for <=32K configurations.

Am I making a mistake some place ? Maybe the LSDMap is old by several versions, it seems to have been updated in Jan of this year.

@jp43
Copy link
Owner

jp43 commented Jul 6, 2015

Hi Vivek, it seems to be an internal MPI-related error, I remember we had a MPI issue some times ago on Stampede, or was it on Archer? For some reasons, it was running OK with a smaller number of CPUs. Could you try running your command with more or less CPU's to see if you still get the same issue (e.g. 16, 32 and 128 CPUs)

@TensorDuck
Copy link
Collaborator

32K configuration to 64K configuration requires a >4-fold increase in memory. I recall seeing segmentation faults when it attempts to use too much memory to construct the LSDMap.

You should try running with more CPUs to see if it's okay.

@vivek-bala
Copy link
Author

I ran lsdmap for a file with 64K configurations with 32 cores and ran into the same error.

For 64K configurations, when I used 128 cores. It reports the following error: https://gist.github.com/vivek-bala/da7423cd35a2ea80ad00. But in this case, it has also produced the .eg, .ev and nearest neighbor files. Where the eigen vector file has 64000 lines.

@vivek-bala
Copy link
Author

It was successful with no errors when I used 256 cores. But it took close to an hour to complete. Is that expected ?

@TensorDuck
Copy link
Collaborator

Hi Vivek,

How large of a protein are you using? How many atoms are you computing RMSD for?

How long did the 32K configurations take?

The scaling for LSDMap is O(N^2) time and memory, where N is number of frames. Most of the time and memory is spent computing the Distance and Kernel Matrix. Lorenzo and Cecilia are working on speeding it up, but that is not ready yet.

@vivek-bala
Copy link
Author

I am using 1-alanine amino acid. It has 22 atoms (https://raw.githubusercontent.com/radical-cybertools/radical.ensemblemd/master/usecases/extasy_gromacs_lsdmap/inp_files/input.gro).

For 32K configurations on 64 cores, the time taken was 1215 seconds.
For 64K configurations on 256 cores, the time taken was 3637 seconds. (Below 256 cores, I am consistently running into the same error as before)

I doubled the number of configurations are increased the resources by 4 times (to account for O(N^2) behaviour). The time taken seems to increase by 3 times.

@TensorDuck
Copy link
Collaborator

Hi Vivek,

That doesn't sound right actually. I'll double check though on our cluster here at Rice and make sure.

The processes not scaling with the number of processors is most baffling to me.

@jp43
Copy link
Owner

jp43 commented Jul 9, 2015

Vivek, could you check the lsdmap.log file to see how much time each step is taking. The O(N^2) behaviour only applies for the computation of the distance matrix, the 3 times difference can be due to the fact that other steps are involved in the overall time.

@vivek-bala
Copy link
Author

For 64K configs:

INFO:root:14:31:50: intializing LSDMap...
INFO:root:14:31:52: input coordinates loaded
INFO:root:14:32:01: LSDMap initialized
INFO:root:14:34:39: distance matrix computed
INFO:root:14:34:42: kernel diagonalized
INFO:root:14:34:50: Eigenvalues/eigenvectors saved (.eg/.ev files)
INFO:root:15:32:10: LSDMap computation done

For 32K configs:

INFO:root:17:40:16: intializing LSDMap...
INFO:root:17:40:16: input coordinates loaded
INFO:root:17:40:17: LSDMap initialized
INFO:root:17:42:56: distance matrix computed
INFO:root:17:42:58: kernel diagonalized
INFO:root:17:42:59: Eigenvalues/eigenvectors saved (.eg/.ev files)
INFO:root:17:59:18: LSDMap computation done

@jp43
Copy link
Owner

jp43 commented Jul 10, 2015

Apparently, most of the time is spent between the two last lines of the log file. At that point, we are basically saving the distance matrix and/or nearest neighbors, only if the flags -n or -d are specified. Did you specify any of these options? If not I already ran into this strange problem when I was running LSDMap tests on Archer. I was basically doing nothing between the two last statements of the log file but it was still taking a lot of time. I concluded that somehow the logging module was not working well with many CPUs. If you are not using any of the flags -n and -d, try to comment the last lines of lsdm.py that print the last two statements of the log file to see if there is any difference.

@vivek-bala
Copy link
Author

I use '-n' to name the neighbour file. I thought that was required. I'll try it with the changes you suggested.

@vivek-bala
Copy link
Author

Could you tell me the exact lines to comment out please.

@jp43
Copy link
Owner

jp43 commented Jul 10, 2015

I think the '-n' flag is only used when running DM-d-MD because you will need to save the nearest neighbors to be able to reweight correctly after selecting the new walkers. However, if you want only to test LSDMap, this option is not mandatory. The lines to comment are lines from 412 to 423 (included) in https://github.com/jp43/lsdmap/blob/master/lsdmap/lsdm.py. However if you use DM-d-MD, simply try commenting the lines 412 and 423 only. In that case, if you still get a slow time between the two last statements of the log file, it would mean that the function "save_nneighbors" takes most of the time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants