Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The exchange probability gap is too large when using GPU for HREX #1177

Closed
wu222222 opened this issue Jan 17, 2025 · 25 comments
Closed

The exchange probability gap is too large when using GPU for HREX #1177

wu222222 opened this issue Jan 17, 2025 · 25 comments

Comments

@wu222222
Copy link

Dear plumed users:

This is my configuration:

GROMACS version:    2020.7-plumed-2.9.2
Precision:          single
Memory model:       64 bit
MPI library:        MPI
OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support:        CUDA
SIMD instructions:  AVX2_256
FFT library:        fftw-3.3.8-sse2-avx-avx2-avx2_128
RDTSCP usage:       enabled
TNG support:        enabled
Hwloc support:      hwloc-2.5.0
Tracing support:    disabled
C compiler:         /usr/bin/cc GNU 11.4.0
C compiler flags:   -mavx2 -mfma -fexcess-precision=fast -funroll-all-loops -O3 -DNDEBUG
C++ compiler:       /usr/bin/c++ GNU 11.4.0
C++ compiler flags: -mavx2 -mfma -fexcess-precision=fast -funroll-all-loops SHELL:-fopenmp -O3 -DNDEBUG
CUDA compiler:      /usr/local/cuda-12.2/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver; Copyright (c) 2005-2023 NVIDIA Corporation; Built on Tue_Jun_13_19:16:58_PDT_2023; Cuda compilation tools, release 12.2, V12.2.91; Build cuda_12.2.r12.2/compiler.32965470_0
CUDA compiler flags: -gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_75,code=compute_75;-use_fast_math;-D_FORCE_INLINES;-mavx2 -mfma -fexcess-precision=fast -funroll-all-loops SHELL:-fopenmp -O3 -DNDEBUG
CUDA driver:        12.20
CUDA runtime:       12.20

When performing Hamiltonian Replica Exchange (HREX), I set the scaling for all replicas to be 1.0, so theoretically, the exchange probabilities should all be 1.0.
Indeed, when using CPU, the exchange probabilities are all 1.0, but when using GPU acceleration, the exchange probabilities vary significantly,
Replica exchange statistics

Repl  399 attempts, 200 odd, 199 even
Repl  average probabilities:
Repl     0    1    2    3    4    5    6    7    8    9   10   11
Repl      .92  .92  .18  .25  .92  .92  .21  .22  .92  .91  .92
Repl  number of exchanges:
Repl     0    1    2    3    4    5    6    7    8    9   10   11
Repl      185  177   39   56  182  187   43   48  176  182  189
Repl  average number of exchanges:
Repl     0    1    2    3    4    5    6    7    8    9   10   11
Repl      .93  .89  **.19**  .28  .91  .94  .22  .24  .88  .91  .94

Is this normal? I recall that there might be insufficient computational precision on GPUs which can lead to significant errors, but would such large discrepancies in exchange probabilities affect the final results? If not, why?

The command I used is as follows:
nohup mpirun --use-hwthread-cpus -np 12 gmx_mpi mdrun -v -deffnm rest -nb gpu -pin on -ntomp 1 -replex 1000 -hrex -multidir rest0 rest1 rest2 rest3 rest4 rest5 rest6 rest7 rest8 rest9 rest10 rest11 -dlb no -plumed plumed.dat > nohup.out 2>&1 &

I am looking forward to your replies with great anticipation.

@GiovanniBussi
Copy link
Member

GiovanniBussi commented Jan 17, 2025

Hi. This is known. My understanding is that energy calculation on the GPU is not reproducible, I guess because of differences in the way energy terms are added. On a large system, even a tiny relative difference could translate to a different acceptance.

Empirically, I never identified problems due to this.

Formally, I am not aware of any justification. A couple of handwaving consideration.

First, the acceptances obtained by the Metropolis formula are sufficient to sample the correct distribution, but not necessary. For instance:

  • if you take the acceptances and multiple all of them by 0.5, you will still get the correct distribution. It will just take longer to converge
  • if you take all the acceptances and multiple all of them by a different random number between 0 and 1, again, you will get the right result.

Strictly speaking, you should check that the ration between the "GPU acceptances" and the "CPU acceptances" (=1.0) is independent of the coordinates of the system. I don't know how to do it and if it is possible.

Second, I suspect that any of such "GPU errors" will also be present when you integrate the equations of motion. So, my feeling is that even if you introduce some small errors in the exchange procedure it will be anyway negligible.

I am not sure how convincing these arguments are.

@wu222222
Copy link
Author

Thank you very much for your response. However, I'm sorry, I didn't quite understand. Isn't the exchange probability supposed to reach a certain level, for example, expected to be between 30%-40%, to consider the hrex successful?
Or, are you saying that the exchange probability cannot determine the success or quality of the hrex?

Hi. This is known. My understanding is that energy calculation on the GPU is not reproducible, I guess because of differences in the way energy terms are added. On a large system, even a tiny relative difference could translate to a different acceptance.

Empirically, I never identified problems due to this.

Formally, I am not aware of any justification. A couple of handwaving consideration.

First, the acceptances obtained by the Metropolis formula are sufficient to sample the correct distribution, but not necessary. For instance:

  • if you take the acceptances and multiple all of them by 0.5, you will still get the correct distribution. It will just take longer to converge
  • if you take all the acceptances and multiple all of them by a different random number between 0 and 1, again, you will get the right result.

Strictly speaking, you should check that the ration between the "GPU acceptances" and the "CPU acceptances" (=1.0) is independent of the coordinates of the system. I don't know how to do it and if it is possible.

Second, I suspect that any of such "GPU errors" will also be present when you integrate the equations of motion. So, my feeling is that even if you introduce some small errors in the exchange procedure it will be anyway negligible.

I am not sure how convincing these arguments are.

@GiovanniBussi
Copy link
Member

Sorry I just noted that these average acceptances are computed after 200 attempts. So I would not expect them to be so different from each other.

In addition, all replicas are identical except possibly for the initial coordinates right?

Can you please report:

  • if you are using a barostat or it is constant volume
  • in the latter case, if all replica volumes are identical (they should)

Then ideally could you plot the histogram of acceptance for each pair of replicas? It would be useful to know if problems are present at all attempts or if there are bimodal distribution. Finally, for each pair, also the time series of the acceptance could be useful (if it's not too messy)

Thanks!!

@wu222222
Copy link
Author

wu222222 commented Jan 18, 2025

Thank you very much for your response. However, I'm sorry, I couldn't understand your point. I ran another REST2 simulation with 12 replicas from 310K to 510K, and the exchange probability was also low:

Replica Exchange Statistics
Repl  399 attempts, 200 odd, 199 even
Repl  average probabilities:
Repl     0    1    2    3    4    5    6    7    8    9   10   11
Repl      .13  .12  .10  .14  .12  .12  .19  .31  .14  .13  .10
Repl  number of exchanges:
Repl     0    1    2    3    4    5    6    7    8    9   10   11
Repl       26   26   19   26   28   24   38   63   26   25   22
Repl  average number of exchanges:
Repl     0    1    2    3    4    5    6    7    8    9   10   11
Repl      .13  .13  .09  .13  .14  .12  .19  .32  .13  .13  .11

Additionally, here is the md.mdp file at the end.

Below is an image I plotted showing the replica traversal, where the y-axis indicates which position the replica is in.
I hope this can be helpful to you.

Image

Image

Image

Image

And here is md,mdp.

md.txt

@GiovanniBussi
Copy link
Member

OK let me explain more clearly.

In your last post, where the energies are different, acceptances could be low. That's life. What's strange is that they are low even when all energies are identical (identical scaling factors).

Could you please go back to the simulation with strange results (low acceptance even if all scaling factors are identical) and post here the md.log file?

It would be ideal if you could also repeat it without barostat. I've seen from the md.txt file that you are using Parrinello Rahman.

Thanks!

Giovanni

@wu222222
Copy link
Author

Thank you very much for your detailed explanation.

Both simulations I mentioned earlier used almost the same md.mdp file as the one above. Following your advice, I ran an initial simulation without a barostat.

Here are the results:

Replica exchange statistics
Repl  399 attempts, 200 odd, 199 even
Repl  average probabilities:
Repl     0    1    2    3    4    5    6    7    8    9   10   11
Repl      .93  .25  .22  .91  .91  .91  .92  .92  .92  .92  .91
Repl  number of exchanges:
Repl     0    1    2    3    4    5    6    7    8    9   10   11
Repl      183   47   40  177  184  184  189  174  186  181  187
Repl  average number of exchanges:
Repl     0    1    2    3    4    5    6    7    8    9   10   11
Repl      .92  .24  .20  .89  .92  .92  .94  .87  .93  .91  .94

And md.mdp

title                   = OPLS Lysozyme NVT equilibration
; Run parameters
integrator              = md        ; leap-frog integrator
nsteps                  = 400000   ; 0.002 * 400000 = 800 ps (0.8 ns)
dt                      = 0.002     ; 2 fs

; Output control
nstxout                 = 0         ; suppress bulky .trr file by specifying
nstvout                 = 0         ; 0 for output frequency of nstxout,
nstfout                 = 0         ; nstvout, and nstfout
nstenergy               = 5000    PLS Lysozyme NVT equilibration
; Run parameters
integrator              = md        ; leap-frog integrator
nsteps                  = 400000   ; 0.002 * 400000 = 800 ps (0.8 ns)
dt                      = 0.002     ; 2 fs

; Output control
nstxout                 = 0         ; suppress bulky .trr file by specifying
nstvout                 = 0         ; 0 for output frequency of nstxout,
nstfout                 = 0         ; nstvout, and nstfout
nstenergy               = 5000       ; save energies every 10.0 ps
nstlog                  = 5000      ; update log file every 10.0 ps
nstxout-compressed      = 5000      ; save compressed coordinates every 10.0 ps
compressed-x-grps       = System    ; save the whole system

; Bond parameters
continuation            = yes       ; Restarting after NPT
constraint_algorithm    = lincs     ; holonomic constraints
constraints             = h-bonds   ; bonds involving H are constrained
lincs_iter              = 1         ; accuracy of LINCS
lincs_order             = 4         ; also related to accuracy

; Neighborsearching
cutoff-scheme           = Verlet    ; Buffered neighbor searching
ns_type                 = grid      ; search neighboring grid cells
nstlist                 = 10        ; 20 fs, largely irrelevant with Verlet scheme
rcoulomb                = 1.0       ; short-range electrostatic cutoff (in nm)
rvdw                    = 1.0       ; short-range van der Waals cutoff (in nm)

; Electrostatics
coulombtype             = PME       ; Particle Mesh Ewald for long-range electrostatics
pme_order               = 4         ; cubic interpolation
fourierspacing          = 0.16      ; grid spacing for FFT

; Temperature coupling is on
tcoupl                  = V-rescale ; modified Berendsen thermostat
tc-grps                 = Protein Non-Protein ; two coupling groups - more accurate
tau_t                   = 0.1 0.1   ; time constant, in ps
ref_t                   = 310 310   ; reference temperature, one for each group, in K

; Pressure coupling is off
;pcoupl                  = Parrinello-Rahman     ; Pressure coupling on in NPT
;pcoupltype              = isotropic             ; uniform scaling of box vectors
;tau_p                   = 2.0                   ; time constant, in ps
;ref_p                   = 1.0                   ; reference pressure, in bar
;compressibility         = 4.5e-5                ; isothermal compressibility of water, bar^-1

; Periodic boundary conditions
pbc                     = xyz       ; 3-D PBC

; Dispersion correction
DispCorr                = EnerPres  ; account for cut-off vdW scheme

; Velocity generation
gen_vel                 = no        ; Velocity generation is off

I hope I have understood you correctly. I look forward to your reply.

@GiovanniBussi
Copy link
Member

GiovanniBussi commented Jan 23, 2025

Correct. Can you share in some way also the md0.log file?

If it's too big, you can just make a smaller file with only the lines beginning with "Repl ", e.g. grep "^Repl " md0.log >short.log

ADDED: And please confirm that all the tpr files are identical (except for coordinates). I mean: this is the case where acceptances should be all equal to 1.0 but they are not because you are using GPU to compute forces. Thanks!

@wu222222
Copy link
Author

Thank you very much for your reply. Is this the file you needed?

short.log

@ncheron
Copy link

ncheron commented Jan 23, 2025

Hi,

I am joining the conversation because we have identified recently something related. We observed that the percentage of exchange between pairs of replicas is bimodal, and going from one value to the other happens at restarts. To illustrate this, we have performed simulations of a small protein: we ran HREX with 8 replicas, all using the same topology and starting from the same equilibrated structure. Exchanges should thus be 100%, whereas it is not the case as you can see in the image where it is either 75% or 100%. The vertical dashed lines represent when we performed a restart (simulations were 25ns long here) and the slops are here because we are doing a running average.

Image

Some additional details :

  • With the same system running on CPU, the exchanges stay around 100% so the problem comes from the GPUs.
  • We performed replica exchange with Gromacs (not patched with plumed) with temperatures being set from 300.0 to 300.7K: the exchanges were around 99%, so no problem here.
  • We tried the GPU-resident mode (i.e. all calculations on GPU) as well as some calculations on the CPUs with "-update cpu": the problem occurs in both cases.
  • The problem is not related to "dynamic load balance" being set to yes or no.
  • The problem occurs on GPUs from NVIDIA (V100 and RTX4090) and AMD (Mi250).
  • The problem occurs on various versions of Gromacs patched with plumed (2019.4, 2022.5, 2023.3).
  • The image above was made with NVT to check the influence of the barostat, and the same problem occurs with NPT.

We can't understand what's going on. Since the change of exchange rate occurs only at restarts, it seems that there is a problem at the beginning of some of the simulations, and that it is not only related to machine precision on GPUs. We don't know if it is due to the reading of checkpoint files, their previous writing, the random number generator that would be different, the domain decomposition, or something else.

Of course, the final question is to know if this matters, and how to check that. I understand the argument of "it will only slow down the convergence", but we found that to converge REST2 simulations we had to wait sometimes more than 2microsec, so if we could speed up that it would be a huge benefit to us.

What are your feelings regarding this? We can of course share more details and/or files.

Thanks,

Nicolas

PS: if someone wants to check if his/her simulation is impacted, you can use the python script below with "python Plot.py -f MD.log -o File.png" (this script was made by Paul Guenon who is a PhD student here at ENS).

import numpy as np
import argparse
import matplotlib.pyplot as plt

"""
This script reads a log file containing replica exchange data, calculates running averages of exchange rates, and plots the results over time.

Usage:
    python plot_exchange_rates_with_time.py -f <logfile> [-o <outputfile>] [-w <window>]

Arguments:
    -f, --file: Name of the log file (required)
    -o, --output: Name of the output PNG file (default: "exchange_rates_along_time.png")
    -w, --window: Window size in percentage for running average (default: 2)

Example:
    python plot_exchange_rates_with_time.py -f exchange_log.txt -o output.png -w 5
"""

def plot_progress_bar(current, total, bar_length=40):
    progress = current / total
    block = int(round(bar_length * progress))
    text = f"\rProgress: [{'#' * block + '-' * (bar_length - block)}] {progress * 100:.2f}%"
    print(text, end='')

def open_file (namefile) :
    file_opened=open(namefile,'r')
    lines_file=file_opened.readlines()
    data=[]
    count=0
    nreplica=0
    dt_tries=0
    times_restart=[]
    restart=False
    print("Loading file : "+namefile)
    for row in lines_file :
        plot_progress_bar(count, len(lines_file), bar_length=40)

        if "Repl ex" in row :
            data.append([x for x in row.split()])
            if nreplica==0 : 
                nreplica=int(row.split()[-1])+1

        if dt_tries==0 and "Replica exchange at step" in row :
            dt_tries=float(row.split()[-1])
        
        if "Restarting from checkpoint" in row : 
            restart=True
        if restart==True and "Replica exchange at step" in row : 
            times_restart.append(float(row.split()[-1]))
            restart=False
        
        count+=1
    plot_progress_bar(count, len(lines_file), bar_length=40)
    file_opened.close()
    return data, nreplica, dt_tries, times_restart

def running_average (y_data, window_size) :
    kernel = np.ones(window_size) / window_size
    y_avg = np.convolve(y_data, kernel, mode='valid')
    return y_avg

def main () :
    
    data,nreplica,dt_tries,times_restart=open_file(namefile)
    ntries=len(data)

    window_size=int(ntries*window_percentage/100)
    times_avg=np.arange((window_size-1)//2,(ntries+1)//2-(window_size)//2,1)*(dt_tries*2)/1000    

    color_choice=plt.cm.jet(np.linspace(0,1,nreplica-1))

    Exchange_array=np.zeros((nreplica-1,(ntries+1)//2))
    count_try=0
    print('\n')
    print("Processing data")
    for i in range (len(data)) :
        plot_progress_bar(i, len(data), bar_length=40)
        if len(data[i])>2 :
            if data[i][0]=='Repl' and data[i][1]=='ex' :
                for j in range(2,len(data[i])):
                    if data[i][j]=='x' :
                        Exchange_array[int(data[i][j-1]),int(count_try/2)]=1
                count_try+=1    
    plot_progress_bar(len(data), len(data), bar_length=40)
    print('\n')
    print("Running averages and plotting")

    ax=plt.subplots()[1]
    for c in range (nreplica-1) : 
        plot_progress_bar(c, nreplica-1, bar_length=40)
        ax.plot(times_avg,running_average(Exchange_array[c,:],window_size), color=color_choice[c],label=str(c)+'->'+str(c+1),lw=1)
    plot_progress_bar(nreplica-1, nreplica-1, bar_length=40)
    for time_restart in times_restart :
        ax.axvline(x=time_restart/1000, color='k', linestyle  = '--', lw=0.5)
    print('\n')
    box = ax.get_position()
    ax.set_position([box.x0, box.y0, box.width * 0.9, box.height])
    ax.legend(loc='center left', bbox_to_anchor=(1, 0.5),fontsize=4)
    ax.set_xlabel('Time in ns')
    ax.set_ylabel('Percentage of successfull exchanges')
    plt.title('Exchange rates from '+namefile,fontsize=8)
    plt.savefig(outputfile,dpi=300)
    plt.close()

if __name__ == "__main__":   
    parser = argparse.ArgumentParser(
        description="This script reads a log file containing replica exchange data, calculates running averages of exchange rates, and plots the results over time.",
        epilog="Example:\n    python plot_exchange_rates_with_time.py -f exchange_log.txt -o output.png -w 5",
        formatter_class=argparse.RawTextHelpFormatter
    )
    
    parser.add_argument('-f', '--file', required=True, help="Name of the log file (required)")
    parser.add_argument('-o', '--output', default="exchange_rates_along_time.png", help="Name of the output PNG file (default: 'exchange_rates_along_time.png')")
    parser.add_argument('-w', '--window', type=float, default=2, help="Window size in percentage of the trajectory for running average (default: 2)")

    args = parser.parse_args()
    namefile = args.file
    outputfile = args.output
    window_percentage = args.window
    main ()

@GiovanniBussi
Copy link
Member

Thanks a lot, this is exactly what I wanted to double check. A few comments:

We performed replica exchange with Gromacs (not patched with plumed) with temperatures being set from 300.0 to 300.7K: the exchanges were around 99%, so no problem here.

This is not really relevant. GROMACS does not compute the extra term in the acceptance (i.e. it assumes tpr files to be identical). I mean: if you use identical potential energy functions, it gives the correct result because it does not compute the extra contribution. As long as you use different potential energy functions, the acceptance will still be 1, incorrectly.

The problem is not related to "dynamic load balance" being set to yes or no.

I suspected this was the problem, but apparently it is not.

In the log file you should find some information about GROMACS adjusting the cutoff for non bonded in order to load balance GPU and CPU. Can you check if it happens that different replicas end up using different settings after this initial part? Specifically, I expect that they could be different in neighboring replicas with low acceptance rate, and identical in neighboring replicas with high acceptance rate.

It might even be worth doing something like:

diff md0.log md1.log

to search for these differences

@ncheron
Copy link

ncheron commented Jan 23, 2025

Good shot ! That was indeed the problem. Paul Guénon wrote a script that looks in each .log file the value of "coulomb cutoff" that is found for the optimal pme grid (Gromacs does that at each restart), and compares for each pair of replica if the two values were the same. The results (for the same simulation as my image above) is shown below. When the value is around 0 it means that the same cutoff was used, and when it is around 1 it means that different cutoffs were used. For the sake of readability, values for each pair are not exactly 0 and 1. There is a perfect correlation between the two behaviours. I will try to run the same system with "mdrun -notunepme" to validate this.

Image

Thanks a lot for finding so quickly the problem !

@GiovanniBussi
Copy link
Member

Amazing. @wu222222 can you confirm that this is also the reason for your difference?

If this is the problem, then some comment below.

The effect of this optimization is that the different replicas have slightly different definitions of the potential energy function.
The reported acceptance is then correct, considering these energy functions.

  • Is this producing incorrect results? No. Or better: the inconsistency between replicas is the same inconsistency that you find if you run the simulation twice and in the two cases the load balancing ends up with a different PME grid. So, we can probably assume that they are equivalent. Otherwise, there's something wrong in the way GROMACS splits long and short range electrostatics.
  • Is this producing suboptimal round trips? To some extent yes. This would be important if the "physical" difference between Hamiltonian is small (e.g., a single residue), because in this case a numerical discrepancy in the PME for the entire water box might contribute even more than the "physical" difference. The worst possible case is "unscaled Hamiltonians", which is the tests we are looking at now, where this is the only contribution. If instead you are changing a larger part of your system (e.g. REST2, with the whole solute) I guess that the contribution of the PME discrepancy should not dominate. Anyway, even in the REST2 simulation it might be useful to check if replicas are stuck at the boundaries between pieces of the ladder with different PME settings.

@ncheron
Copy link

ncheron commented Jan 24, 2025

Thanks for your thoughts on this. I confirm that with "-notunepme" on the mdrun line, the problem disappears. The simulation is even going slightly faster with -notunepme than without (440 vs 430ns/day)

Image

@GiovanniBussi
Copy link
Member

even going slightly faster with -notunepme than without

Good point. Actually, in a replica exchange simulation the overall simulation runs at the speed of the slowest. So, having different PME settings in different nodes is always a bad idea. Perhaps this should be reported to GROMACS people @mabraham ?

@mabraham
Copy link
Contributor

mabraham commented Jan 24, 2025

PME tuning is on by default when there's a decent chance of effectively re-balancing the computational workload. That's always the case when running on GPUs (though maybe most impactful when the reciprocal-space part of PME runs on the CPUs) but only for CPU-only runs when enough ranks are used per replica (IIRC >= 24). This is the origin of the apparent difference between the GPU and CPU implementations in GROMACS: the default for PME tuning differs.

The effect of this optimization is that the different replicas have slightly different definitions of the potential energy function. The reported acceptance is then correct, considering these energy functions.

  • Is this producing incorrect results? No. Or better: the inconsistency between replicas is the same inconsistency that you find if you run the simulation twice and in the two cases the load balancing ends up with a different PME grid. So, we can probably assume that they are equivalent. Otherwise, there's something wrong in the way GROMACS splits long and short range electrostatics.

I think not. It depends how the HREX charge scaling is implemented. From the above observations it looks like the potential energy is different when the Coulomb cutoff is different. The simplest kind of way that would happen would be if the charges are scaled only for the direct-space component, and not for the reciprocal-space component. Then when rcoulomb is different because of PME runing, the exchange attempt is between Hamiltonians that are different in an unintended way. If so, then I think the HREX implementation is technically wrong, though whether the impact of e.g. the missing charge scaling in reciprocal space is measurable when rcoulomb is consistent over replica space remains to be seen. Can you point me at the code that implements the charge scaling, please?

@mabraham
Copy link
Contributor

Actually, in a replica exchange simulation the overall simulation runs at the speed of the slowest. So, having different PME settings in different nodes is always a bad idea.

Yes and no. If the runtime environment and performance characteristics of each replica is in fact equivalent, then the tuning will converge to the same point on each replica and tuning gives the chance for a performance improvement. In practice, it depends on the extent to which replicas use shared resources, including e.g PCI/e connections between host and multiple GPUs in a node. Contention means that replicas will fall out of sync with each other and now they experience different runtime environments leading to different conclusions about respective PME tuning optima. If the performance characteristics are different (e.g. when pressure coupling is on, volumes fluctuate, leading to performance variation, which is worse when pressure is the control variable that differs between replicas) then there are cases when PME tuning can be expected to improve overall progress because it makes it possible to tune differently on different replicas. In some cases, you'd only get the benefits if PME tuning could decide to turn itself back on (e.g. after successful exchanges) but that is not implemented.

@GiovanniBussi
Copy link
Member

Thanks @mabraham for the very quick feedback!

If so, then I think the HREX implementation is technically wrong, though whether the impact of e.g. the missing charge scaling in reciprocal space is measurable when rcoulomb is consistent over replica space remains to be seen.

I actually think that it is correct, maybe I mis-explained. Technically, we are saying that two replicas have slightly different Hamiltonians. One of them with a coarse grid and large cutoff, another one with a fine grid and a small cutoff. The two systems are different, that's why we get a <1 acceptance. But the difference should be negligible if the real space cutoff and the grid spacing are consistent with each other. I guess there's some parameter in the PME setting that sets the relationship with some tolerance. As far as I understand how PME is working, one could push this tolerance further to have exactly the same result independently of the grid spacing. Notice that a difference of a fraction of kBT would visibly affect the acceptance in a -hrex simulation.

Can you point me at the code that implements the charge scaling, please?

If you mean the charge scaling in the -hrex simulation, this is done a priori on the topology (top file). So, in the present test, we are talking about identical topologies. I think this should not affect the result. In fact:

  • atoms are sent to the neighboring replica (which has different settings)
  • energy is recalculated
  • the two values of the energy are compared and enter the acceptance.

Here we are talking about two replicas which have identical input files and topologies, but one of them decided to use a coarser PME grid and the other a finer PME grid. And we are measuring the energy difference.

Yes and no. If the runtime environment and performance characteristics of each replica is in fact equivalent, then the tuning will converge to the same point on each replica and tuning gives the chance for a performance improvement.

I see. I agree that probably it's best to let the choice to the user.

Thanks again!

GiovanniBussi added a commit to plumed/masterclass-22-10 that referenced this issue Jan 24, 2025
@mabraham
Copy link
Contributor

I see that the GROMACS documentation is deficient here. I made https://gitlab.com/gromacs/gromacs/-/issues/5278 so we remember to fix it.

The implementation of the tuning is straightforward at high level: rcoulomb and the Fourier grid spacing are scaled by the same amount because that scales the potential energy (and thus force) contributions from both in the same way, so should be an equivalent model physics. That necessarily changes the number of discrete grid points used for a simulation cell of a given volume; I don't remember whether that feeds back into a further tweak of rcoulomb. There are secondary effects on the implementation from changing rcoulomb, such as changing short-range buffer sizes and halo-exchange volumes.

In any case "equivalence" depends on the use case, and this use case appears sensitive to the differences, though I do not understand why. The quality of the PME approximation when interpolating particles to grid points could be different for different particle configurations (e.g. highly charged particles closer to grid points or not).

It seems advisable to disable PME tuning with HREX, at any rate.

@mabraham
Copy link
Contributor

Here we are talking about two replicas which have identical input files and topologies, but one of them decided to use a coarser PME grid and the other a finer PME grid. And we are measuring the energy difference.

OK, then my earlier guess was wrong: the HREX implementation is correctly scaling the charges and the two computations in PME are consistent in all cases.

@ncheron
Copy link

ncheron commented Jan 24, 2025

@mabraham I was about to post a message on the Gromacs forum to advice turning off PME tuning with HREX, but I don't need to do that anymore !

Regarding -dlb, it is still not clear to me what are your official recommendations. It is advised here https://www.plumed.org/doc-v2.9/user-doc/html/hrex.html to turn it off, however I never did so and my simulations never crashed. So what are your thoughts on that?

@mabraham
Copy link
Contributor

Regarding -dlb, it is still not clear to me what are your official recommendations. It is advised here https://www.plumed.org/doc-v2.9/user-doc/html/hrex.html to turn it off, however I never did so and my simulations never crashed. So what are your thoughts on that?

I see no reason why dynamic load balancing itself should be relevant. Keeping all the domains a fixed size is not intrinsically safer for HREX. Adjusting the domain volumes to try to balance the force-computation time doesn't change the quality of the approximation in either part of the PME algorithm.

I see that one issue that probably led to that recommendation derived from experience with a simulation using a coarse-grained force field. The halo-exchange volume is based on the pair-list radius (including buffer) which in turn is estimated analytically from particle density and expected diffusion (see GROMACS documentation) which might be relatively easy to violate with a coarse-grained simulation. If so, and if one had a simulation that seemed to crash with DLB on, decreasing https://manual.gromacs.org/current/user-guide/mdp-options.html#mdp-verlet-buffer-tolerance is an option that will lead GROMACS to be more "conservative" here, which might improve the symptoms. Of course, the problem could be something else related to some dramatic rearrangement of the simulation particles.

@GiovanniBussi
Copy link
Member

see no reason why dynamic load balancing itself should be relevant

My understanding is that if you change the domain decomposition you will change the order in which forces are accumulated. Since GROMACS does this in floating point arithmetics, the result might be slightly different.

I think there are two levels of accuracy needed:

  • If the user wants to test that acceptance is 1.0 with identical topology, it should be exactly 1.0. 0.99 would be an error. I suspect that to obtain this one has to disable all these optimizations
  • If the user does not want to add additional bottlenecks in replica diffusion, any setting that would result in an acceptance close to 1 in the "identical topology" settings would be ok. The examples above suggest that at least -notunepme is necessary in this second case.

@mabraham
Copy link
Contributor

My understanding is that if you change the domain decomposition you will change the order in which forces are accumulated

Yes, the identity of the particles in each domain changes each time they are repartitioned, because they diffuse. That changes the order of force accumulation on CPU. On a GPU, the force accumulation is non-reproducible by design, even when using a single domain. Changing the number of particles in each domain (i.e. with DLB on) is merely an additional source of difference.

If the user wants to test that acceptance is 1.0 with identical topology, it should be exactly 1.0. 0.99 would be an error. I suspect that to obtain this one has to disable all these optimizations

... and perhaps also to use a single domain and not use a GPU. But even then the result force computation depends on the particle order in the input file. If this is a use case that would actually warrant GROMACS implementing force accumulation in fixed precision to get reproducibility, that would be interesting to hear about.

@GiovanniBussi
Copy link
Member

and perhaps also to use a single domain and not use a GPU

True, indeed this is what we traditionally suggested. Still, it's useful to test with GPU to make sure that, when one will use it in production, there will not be gross errors. I think this thread with all the posted test will be a good reference for this.

If this is a use case that would actually warrant GROMACS implementing force accumulation in fixed precision to get reproducibility, that would be interesting to hear about.

I don't think this is a use case where this is necessary, it's mostly a test.

Still it might be nice to have complete reproducibility to be able to reproduce and investigate rare events such as system explosion or similar (e.g., starting from a check point). And my understanding is that integer arithmetics is also faster (is this true?). I don't know if this is in the GROMACS road map.

@mabraham
Copy link
Contributor

Still it might be nice to have complete reproducibility to be able to reproduce and investigate rare events such as system explosion or similar (e.g., starting from a check point).

Mmmm, but those are relatively rare and mostly impact developers, so spending a lot of developer time re-working things that work is opportunity lost to do things that directly impact users.

And my understanding is that integer arithmetics is also faster (is this true?).

Depends on the instruction mix and how coupled future instructions are to the ones currently executing. An integer operation is simpler. But here we aren't going to use the result of the force accumulation until the next force accumulation, so the difference is probably negligible.

I don't know if this is in the GROMACS road map.

Not until there's a user-land use case!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants