-
Notifications
You must be signed in to change notification settings - Fork 295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The exchange probability gap is too large when using GPU for HREX #1177
Comments
Hi. This is known. My understanding is that energy calculation on the GPU is not reproducible, I guess because of differences in the way energy terms are added. On a large system, even a tiny relative difference could translate to a different acceptance. Empirically, I never identified problems due to this. Formally, I am not aware of any justification. A couple of handwaving consideration. First, the acceptances obtained by the Metropolis formula are sufficient to sample the correct distribution, but not necessary. For instance:
Strictly speaking, you should check that the ration between the "GPU acceptances" and the "CPU acceptances" (=1.0) is independent of the coordinates of the system. I don't know how to do it and if it is possible. Second, I suspect that any of such "GPU errors" will also be present when you integrate the equations of motion. So, my feeling is that even if you introduce some small errors in the exchange procedure it will be anyway negligible. I am not sure how convincing these arguments are. |
Thank you very much for your response. However, I'm sorry, I didn't quite understand. Isn't the exchange probability supposed to reach a certain level, for example, expected to be between 30%-40%, to consider the hrex successful?
|
Sorry I just noted that these average acceptances are computed after 200 attempts. So I would not expect them to be so different from each other. In addition, all replicas are identical except possibly for the initial coordinates right? Can you please report:
Then ideally could you plot the histogram of acceptance for each pair of replicas? It would be useful to know if problems are present at all attempts or if there are bimodal distribution. Finally, for each pair, also the time series of the acceptance could be useful (if it's not too messy) Thanks!! |
Thank you very much for your response. However, I'm sorry, I couldn't understand your point. I ran another REST2 simulation with 12 replicas from 310K to 510K, and the exchange probability was also low:
Additionally, here is the md.mdp file at the end. Below is an image I plotted showing the replica traversal, where the y-axis indicates which position the replica is in. And here is md,mdp. |
OK let me explain more clearly. In your last post, where the energies are different, acceptances could be low. That's life. What's strange is that they are low even when all energies are identical (identical scaling factors). Could you please go back to the simulation with strange results (low acceptance even if all scaling factors are identical) and post here the md.log file? It would be ideal if you could also repeat it without barostat. I've seen from the md.txt file that you are using Parrinello Rahman. Thanks! Giovanni |
Thank you very much for your detailed explanation. Both simulations I mentioned earlier used almost the same md.mdp file as the one above. Following your advice, I ran an initial simulation without a barostat. Here are the results:
And md.mdp
I hope I have understood you correctly. I look forward to your reply. |
Correct. Can you share in some way also the md0.log file? If it's too big, you can just make a smaller file with only the lines beginning with "Repl ", e.g. ADDED: And please confirm that all the tpr files are identical (except for coordinates). I mean: this is the case where acceptances should be all equal to 1.0 but they are not because you are using GPU to compute forces. Thanks! |
Thank you very much for your reply. Is this the file you needed? |
Hi, I am joining the conversation because we have identified recently something related. We observed that the percentage of exchange between pairs of replicas is bimodal, and going from one value to the other happens at restarts. To illustrate this, we have performed simulations of a small protein: we ran HREX with 8 replicas, all using the same topology and starting from the same equilibrated structure. Exchanges should thus be 100%, whereas it is not the case as you can see in the image where it is either 75% or 100%. The vertical dashed lines represent when we performed a restart (simulations were 25ns long here) and the slops are here because we are doing a running average. Some additional details :
We can't understand what's going on. Since the change of exchange rate occurs only at restarts, it seems that there is a problem at the beginning of some of the simulations, and that it is not only related to machine precision on GPUs. We don't know if it is due to the reading of checkpoint files, their previous writing, the random number generator that would be different, the domain decomposition, or something else. Of course, the final question is to know if this matters, and how to check that. I understand the argument of "it will only slow down the convergence", but we found that to converge REST2 simulations we had to wait sometimes more than 2microsec, so if we could speed up that it would be a huge benefit to us. What are your feelings regarding this? We can of course share more details and/or files. Thanks, Nicolas PS: if someone wants to check if his/her simulation is impacted, you can use the python script below with "python Plot.py -f MD.log -o File.png" (this script was made by Paul Guenon who is a PhD student here at ENS).
|
Thanks a lot, this is exactly what I wanted to double check. A few comments:
This is not really relevant. GROMACS does not compute the extra term in the acceptance (i.e. it assumes tpr files to be identical). I mean: if you use identical potential energy functions, it gives the correct result because it does not compute the extra contribution. As long as you use different potential energy functions, the acceptance will still be 1, incorrectly.
I suspected this was the problem, but apparently it is not. In the log file you should find some information about GROMACS adjusting the cutoff for non bonded in order to load balance GPU and CPU. Can you check if it happens that different replicas end up using different settings after this initial part? Specifically, I expect that they could be different in neighboring replicas with low acceptance rate, and identical in neighboring replicas with high acceptance rate. It might even be worth doing something like:
to search for these differences |
Good shot ! That was indeed the problem. Paul Guénon wrote a script that looks in each .log file the value of "coulomb cutoff" that is found for the optimal pme grid (Gromacs does that at each restart), and compares for each pair of replica if the two values were the same. The results (for the same simulation as my image above) is shown below. When the value is around 0 it means that the same cutoff was used, and when it is around 1 it means that different cutoffs were used. For the sake of readability, values for each pair are not exactly 0 and 1. There is a perfect correlation between the two behaviours. I will try to run the same system with "mdrun -notunepme" to validate this. Thanks a lot for finding so quickly the problem ! |
Amazing. @wu222222 can you confirm that this is also the reason for your difference? If this is the problem, then some comment below. The effect of this optimization is that the different replicas have slightly different definitions of the potential energy function.
|
Good point. Actually, in a replica exchange simulation the overall simulation runs at the speed of the slowest. So, having different PME settings in different nodes is always a bad idea. Perhaps this should be reported to GROMACS people @mabraham ? |
PME tuning is on by default when there's a decent chance of effectively re-balancing the computational workload. That's always the case when running on GPUs (though maybe most impactful when the reciprocal-space part of PME runs on the CPUs) but only for CPU-only runs when enough ranks are used per replica (IIRC >= 24). This is the origin of the apparent difference between the GPU and CPU implementations in GROMACS: the default for PME tuning differs.
I think not. It depends how the HREX charge scaling is implemented. From the above observations it looks like the potential energy is different when the Coulomb cutoff is different. The simplest kind of way that would happen would be if the charges are scaled only for the direct-space component, and not for the reciprocal-space component. Then when rcoulomb is different because of PME runing, the exchange attempt is between Hamiltonians that are different in an unintended way. If so, then I think the HREX implementation is technically wrong, though whether the impact of e.g. the missing charge scaling in reciprocal space is measurable when rcoulomb is consistent over replica space remains to be seen. Can you point me at the code that implements the charge scaling, please? |
Yes and no. If the runtime environment and performance characteristics of each replica is in fact equivalent, then the tuning will converge to the same point on each replica and tuning gives the chance for a performance improvement. In practice, it depends on the extent to which replicas use shared resources, including e.g PCI/e connections between host and multiple GPUs in a node. Contention means that replicas will fall out of sync with each other and now they experience different runtime environments leading to different conclusions about respective PME tuning optima. If the performance characteristics are different (e.g. when pressure coupling is on, volumes fluctuate, leading to performance variation, which is worse when pressure is the control variable that differs between replicas) then there are cases when PME tuning can be expected to improve overall progress because it makes it possible to tune differently on different replicas. In some cases, you'd only get the benefits if PME tuning could decide to turn itself back on (e.g. after successful exchanges) but that is not implemented. |
Thanks @mabraham for the very quick feedback!
I actually think that it is correct, maybe I mis-explained. Technically, we are saying that two replicas have slightly different Hamiltonians. One of them with a coarse grid and large cutoff, another one with a fine grid and a small cutoff. The two systems are different, that's why we get a <1 acceptance. But the difference should be negligible if the real space cutoff and the grid spacing are consistent with each other. I guess there's some parameter in the PME setting that sets the relationship with some tolerance. As far as I understand how PME is working, one could push this tolerance further to have exactly the same result independently of the grid spacing. Notice that a difference of a fraction of kBT would visibly affect the acceptance in a -hrex simulation.
If you mean the charge scaling in the -hrex simulation, this is done a priori on the topology (top file). So, in the present test, we are talking about identical topologies. I think this should not affect the result. In fact:
Here we are talking about two replicas which have identical input files and topologies, but one of them decided to use a coarser PME grid and the other a finer PME grid. And we are measuring the energy difference.
I see. I agree that probably it's best to let the choice to the user. Thanks again! |
Added reference to plumed/plumed2#1177
I see that the GROMACS documentation is deficient here. I made https://gitlab.com/gromacs/gromacs/-/issues/5278 so we remember to fix it. The implementation of the tuning is straightforward at high level: rcoulomb and the Fourier grid spacing are scaled by the same amount because that scales the potential energy (and thus force) contributions from both in the same way, so should be an equivalent model physics. That necessarily changes the number of discrete grid points used for a simulation cell of a given volume; I don't remember whether that feeds back into a further tweak of rcoulomb. There are secondary effects on the implementation from changing rcoulomb, such as changing short-range buffer sizes and halo-exchange volumes. In any case "equivalence" depends on the use case, and this use case appears sensitive to the differences, though I do not understand why. The quality of the PME approximation when interpolating particles to grid points could be different for different particle configurations (e.g. highly charged particles closer to grid points or not). It seems advisable to disable PME tuning with HREX, at any rate. |
OK, then my earlier guess was wrong: the HREX implementation is correctly scaling the charges and the two computations in PME are consistent in all cases. |
@mabraham I was about to post a message on the Gromacs forum to advice turning off PME tuning with HREX, but I don't need to do that anymore ! Regarding -dlb, it is still not clear to me what are your official recommendations. It is advised here https://www.plumed.org/doc-v2.9/user-doc/html/hrex.html to turn it off, however I never did so and my simulations never crashed. So what are your thoughts on that? |
I see no reason why dynamic load balancing itself should be relevant. Keeping all the domains a fixed size is not intrinsically safer for HREX. Adjusting the domain volumes to try to balance the force-computation time doesn't change the quality of the approximation in either part of the PME algorithm. I see that one issue that probably led to that recommendation derived from experience with a simulation using a coarse-grained force field. The halo-exchange volume is based on the pair-list radius (including buffer) which in turn is estimated analytically from particle density and expected diffusion (see GROMACS documentation) which might be relatively easy to violate with a coarse-grained simulation. If so, and if one had a simulation that seemed to crash with DLB on, decreasing https://manual.gromacs.org/current/user-guide/mdp-options.html#mdp-verlet-buffer-tolerance is an option that will lead GROMACS to be more "conservative" here, which might improve the symptoms. Of course, the problem could be something else related to some dramatic rearrangement of the simulation particles. |
My understanding is that if you change the domain decomposition you will change the order in which forces are accumulated. Since GROMACS does this in floating point arithmetics, the result might be slightly different. I think there are two levels of accuracy needed:
|
Yes, the identity of the particles in each domain changes each time they are repartitioned, because they diffuse. That changes the order of force accumulation on CPU. On a GPU, the force accumulation is non-reproducible by design, even when using a single domain. Changing the number of particles in each domain (i.e. with DLB on) is merely an additional source of difference.
... and perhaps also to use a single domain and not use a GPU. But even then the result force computation depends on the particle order in the input file. If this is a use case that would actually warrant GROMACS implementing force accumulation in fixed precision to get reproducibility, that would be interesting to hear about. |
True, indeed this is what we traditionally suggested. Still, it's useful to test with GPU to make sure that, when one will use it in production, there will not be gross errors. I think this thread with all the posted test will be a good reference for this.
I don't think this is a use case where this is necessary, it's mostly a test. Still it might be nice to have complete reproducibility to be able to reproduce and investigate rare events such as system explosion or similar (e.g., starting from a check point). And my understanding is that integer arithmetics is also faster (is this true?). I don't know if this is in the GROMACS road map. |
Mmmm, but those are relatively rare and mostly impact developers, so spending a lot of developer time re-working things that work is opportunity lost to do things that directly impact users.
Depends on the instruction mix and how coupled future instructions are to the ones currently executing. An integer operation is simpler. But here we aren't going to use the result of the force accumulation until the next force accumulation, so the difference is probably negligible.
Not until there's a user-land use case! |
Dear plumed users:
This is my configuration:
When performing Hamiltonian Replica Exchange (HREX), I set the scaling for all replicas to be 1.0, so theoretically, the exchange probabilities should all be 1.0.
Indeed, when using CPU, the exchange probabilities are all 1.0, but when using GPU acceleration, the exchange probabilities vary significantly,
Replica exchange statistics
Is this normal? I recall that there might be insufficient computational precision on GPUs which can lead to significant errors, but would such large discrepancies in exchange probabilities affect the final results? If not, why?
The command I used is as follows:
nohup mpirun --use-hwthread-cpus -np 12 gmx_mpi mdrun -v -deffnm rest -nb gpu -pin on -ntomp 1 -replex 1000 -hrex -multidir rest0 rest1 rest2 rest3 rest4 rest5 rest6 rest7 rest8 rest9 rest10 rest11 -dlb no -plumed plumed.dat > nohup.out 2>&1 &
I am looking forward to your replies with great anticipation.
The text was updated successfully, but these errors were encountered: