Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

9.0 vs 8.2.6 simulation run performance ( Caliper) for ModelDB models. #3283

Open
nrnhines opened this issue Dec 10, 2024 · 18 comments
Open

Comments

@nrnhines
Copy link
Member

nrnhines commented Dec 10, 2024

NEURON versions 9.0a-467-gbaf9ea8db and 8.2.6
nrn-modeldb-ci version 6f28ea8
Python 3.11.8
cmake configuration (both NEURON versions)

cmake .. -G Ninja -DCMAKE_INSTALL_PREFIX=install -DNRN_ENABLE_PROFILING=ON -DNRN_PROFILER=caliper -DCMAKE_PREFIX_PATH=$HOME/soft/Caliper/build/install

In nrn-modeldb-ci, run with:

export CALI_CONFIG=runtime-report
runmodels --clean --workdir master --virtual

and also with a workdir (and 8.2.6 NEURON version ) of 8.2
The runmodels commands produce 7055004 Dec 10 14:33 master.json and 6886403 Dec 10 14:52 8.2.json files that contain lines like

{
    "0": {
        "NEURON version": "8.2.6",
...
    "267691": {
...
       "nrn_run": [
...
           "Path                               Time (E) Time (I)  Time % (E) Time % (I) ",
...
            "fadvance                           0.034119 11.073676   0.306922  99.613264 ",

8.2.json contained 257 "fadvance..." lines and 2 "psolve..." lines.

Plotting the master times (y axis) and 8.2.6 times (x axis) on log and linear plots gives:
image
image
When master time is larger than 8.2 time, the point for that model is above the red line (equal times).
The three largest outliers (master larger than 8.2) for sim time > 1 second are

[model id, 8.2.6(s),  master(s)]
['206267', 48.64375, 86.6161]     206267/GrC_FHF_ModelDB cvode.use_daspk(1)  (uses IDA)
['267666', 9.51025, 27.4894]       267666/publishable_code_4 cvode.use_local_dt(1)
['156120', 3.503465, 16.341625] 156120/HAE_LAE_Netk cvode.use_local_dt(1)
@1uc
Copy link
Collaborator

1uc commented Dec 11, 2024

There's two PRs in the NMODL repo that might help:

The former "fixes" the aliasing issue with a sledge hammer (for certain cases) by adding #pragma ivdep or #pragma omp simd. The latter switches replaces the handcoded "Crout" solver with Eigen.

@nrnhines
Copy link
Member Author

nrnhines commented Dec 11, 2024

The caliper information for the above three models is too incomplete for a reliable assessment. Will need to add instrumenting for the major phases of cvode calls into NEURON. Nevertheless, the existing timings elicit two initial prejudices. 1) %times seem to scale similarly between 8.2 and 9.0. (i.e. absolute time ratios are generally somewhat similar to the fadvance time ratio). 2) mechanism times are generally a quite small component of the overall fadvance time. For example, here is the full caliper info for 156120

['156120', 3.503465, 16.341625] 156120/HAE_LAE_Netk cvode.use_local_dt(1)

8.2

            "Path                   Time (E) Time (I) Time % (E) Time % (I) ",  
            "finitialize            0.008033 0.008110   0.191526   0.193362 ",  
            "  gap-v-transfer       0.000033 0.000033   0.000785   0.000785 ",  
            "  spike-exchange       0.000044 0.000044   0.001051   0.001051 ",  
            "fadvance               3.345042 3.503465  79.750723  83.527754 ",
            "  net-receive-ExpSyn   0.001325 0.001325   0.031597   0.031597 ",
            "  net-receive-Gaba_syn 0.051325 0.051325   1.223656   1.223656 ",
            "  net-receive-AMPA_S   0.105773 0.105773   2.521778   2.521778 "
            
   
master      

            "Path                   Time (E)  Time (I)  Time % (E) Time % (I) ",
            "finitialize             0.019072  0.019269   0.109482   0.110611 ",
            "  gap-v-transfer        0.000073  0.000073   0.000420   0.000420 ",
            "  spike-exchange        0.000124  0.000124   0.000709   0.000709 ",
            "fadvance               15.713475 16.341625  90.200473  93.806256 ",
            "  net-receive-ExpSyn    0.004790  0.004790   0.027497   0.027497 ",
            "  net-receive-Gaba_syn  0.172118  0.172118   0.988014   0.988014 "
            "  net-receive-AMPA_S    0.451241  0.451241   2.590272   2.590272 ",

ModelView shows that this model has (summarizing the much larger actual output)

101 sections; 101 segments
* 101 real cells (100 encapsulated in 100 distinct objects of 2 classes)
    * 80 Pyramidal   
            80 sections; 80 segments
            * 7 inserted mechanisms
                Ra = 150
                cm = 1
                * pas
                ena = 50   
                ek = -100  
                gbar_KPyr = 800
                gbar_NaPyr = 1000
            * 8160 Point Processes
                * 80 IClamp
                * 80 ExpSyn
                * 1600 Gaba_syn
                * 6400 AMPA_S
    * 20 fast_spiking
            20 sections; 20 segments
            * 7 inserted mechanisms
                Ra = 150
                cm = 1
                * pas
                ena = 50
                ek = -100
                gbar_KPyr = 800
                gbar_NaPyr = 1000
            * 2040 Point Processes
                * 20 IClamp
                * 20 ExpSyn
                * 400 Gaba_syn
                * 1600 AMPA_S
* 100 artificial cells of 1 base classes
    * 100 NetStim
* 4292 NetCon objects
    delay = 1

In addition to the local variable time step, it uses cvode.condition_order(2). By the way, local variable time step for this model means there are 100 instances of cvode with independent time step integration each managing just one cell.
cvode.statistics() prints 100 groups of cvode instance data. e.g. a couple of each cell type is

Cvode instance 0x55ab50904038 cvode_init_dummy statistics : 1 CVode states
   2 advance_tn, 0 interpolate, 1 init (0 due to at_time)
   8 function evaluations, 2 mx=b solves, 0 jacobian setups

Cvode instance 0x55ab50904128 Pyramidal[0].soma statistics : 185 CVode states
   1180 advance_tn, 168 interpolate, 170 init (1 due to at_time)
   3825 function evaluations, 1797 mx=b solves, 0 jacobian setups
...
Cvode instance 0x55ab50908b38 Pyramidal[79].soma statistics : 185 CVode states
   1069 advance_tn, 162 interpolate, 164 init (1 due to at_time)
   3483 function evaluations, 1596 mx=b solves, 0 jacobian setups

Cvode instance 0x55ab50908c28 fast_spiking[0].soma statistics : 185 CVode states
   980 advance_tn, 219 interpolate, 221 init (1 due to at_time)
   3521 function evaluations, 1438 mx=b solves, 0 jacobian setups
...
Cvode instance 0x55ab50909df8 fast_spiking[19].soma statistics : 185 CVode states
   1331 advance_tn, 231 interpolate, 233 init (1 due to at_time)
   4521 function evaluations, 2027 mx=b solves, 0 jacobian setups

Unfortunately, event statistics do not seem to be gathered for the local variable step method. But they can roughly be inferred from the number of initializations and interpolations of the cvode instances. This is a single thread serial model. The above cvode statistics do give an idea of the great deal of assynchronous and random access into each of the range variable vectors via the cvode instances. Cvode instances boil forward under the contraints:

all cvode t0 <= all cvode t1   (t0 is previous step time, t1 current step time (cvode can iterpolate to any time in its interval t0 <= t <= t1)
all cvode t0 <= next global event time  ( events are monotonic in time).

The rule is always: what is the next earliest thing to do. Handle an event or do a cvode step with the least t1.
See https://pmc.ncbi.nlm.nih.gov/articles/PMC2712447/

@nrnhines
Copy link
Member Author

I'm looking into more detail with regard to fadvance and scatter/gather performance for 8.2 and 9.0 with cvode. My guess (hope) is that global time step, scatter/gather is the 9.0 culprit due to DataHandle overhead. That can be easily solved with the use of raw pointers which is possible and safe since a time step is carried out with a sorted_token. On the other hand I fear that local variable time step performance is poor in 9.0 because per variable arrays don't help with the actual memory access patterns of the local variable time step method. Since local variable time step is only rarely used, in comparison with the global variable time step methods, fixing the latter will eliminate this distribution release performance blocker. Ironically, I believe CoreNEURON style single array for all variables with a permutation that keeps all cell variables adjacent, i.e. AoS, would allow local variable time step 9.0 peformance to equal 8.2 performance.

Anyway, what is the evidence? I'm approaching the performance measurement issue by measuring intervals (ns) from beginning to end of a given function with ZTBEGIN and ZTOTAL(i) macros that count the number of calls, accumulate the interval times, and store the interval times in a length 256 ring buffer.

#include <time.h>

#define MAX_ZTIME 5
#define MAX_ZTIME_DETAIL 0xff
extern size_t ztime[MAX_ZTIME];
extern int ztcnt[MAX_ZTIME];
extern size_t ztime_detail[MAX_ZTIME][MAX_ZTIME_DETAIL+1];

inline size_t zt_realtime() {
    timespec ts;
    clock_gettime(CLOCK_MONOTONIC, &ts);
    const size_t billion{1000000000};
    // const size_t floatmask{0Xfffffffffffff};  // 52 bits
    return ((billion * ts.tv_sec) + ts.tv_nsec); // & floatmask;
}

#define ZTBEGIN size_t zt = zt_realtime()
#define ZTTOTAL(i) \
    zt = zt_realtime() - zt; \
    ztime_detail[i][ztcnt[i]&MAX_ZTIME_DETAIL] = zt; \
    ztime[i] += zt; \
    ++ztcnt[i];

For the ['156120', 3.503465, 16.341625] 156120/HAE_LAE_Netk cvode.use_local_dt(1) model, at the end of a run an example stdout for 9.0 is, counter index, accumulated time (seconds), and call count.

0 7.35675 90134 256
1 1.15987 1136180 256

The 0 index is for "fadvance" and the 1 index is used for the 4 "scatter/gather" patterns used by cvode for state and dstate.

Plotting the fadvance raw interval buffer (scaled to microseconds) vs buffer index, yields:
image
Notice that in this case there is one large outlier. (I suppose the cpu interrupted and started doing something else within the interval). Zooming in and plotting again, as well as sorting and plotting, shows, for fadvance a scatter/gather
image
Notice in the "fadvance raw intervals" graph there are two very small intervals. This indicates that two time steps of the last 256 time steps for all the 101 cells were devoted to the single empty cell of this model. (This model has 1 empty cell, 80 Pyramidal cells, and 20 fast spiking cells. The minimum plateaus of these cell types are clearly visible on the "sorted intervals" graphs. What the cpu is doing in detail during longer interval times for a call, e.g. memory access cache misses, is not clear to me. It might be better to focus on these things with perf but this seems to give a basic measurement for the majority of intervals and we can compare these to the 8.2 measurements. Note that scatter/gather in the graphs is only about 2% of fadvance. But that is misleading, the call ratio is 1136180/90134 = 12.6, which means total time is more like 25%. It is not clear to me why this is so different from the cumulative time ratio

 >>> 1.15987 /7.35675
0.1576

One wrinkle in regard to
model 156120 is that randomness is different for 8.2 and 9.0 which resulted in about 15% more fadvance calls for 9.0.
This was fixed by converting the model to use only Random123 streams.

Here is the run result of this model for 8.2.6

ZTIME 0 1.52461 90134
ZTIME 1 0.206413 1136180

image

@nrnhines
Copy link
Member Author

nrnhines commented Dec 19, 2024

Turning off the local variable time step (using global variable time step), the results for the 156120 model are
8.2

ZTIME 0 15.2608 11024
ZTIME 1 3.13947 89875

image

9.0

ZTIME 0 15.539 11024
ZTIME 1 8.15979 89875

image

Clearly this model will be 5 seconds faster with 9.0 than with 8.2 if we use cached DataHandles for the scatter/gather phase of simulation.

edited: Notice that my local step explanation for the first 25 almost 0 time intervals for scatter/gather plots makes no sense for this global variable step scatter/gather plot. At the moment it is not clear to me why those intervals are small.

edited: When Cvode is initialized, f(y, ydot) is called with ydot = NULL. In those cases, gather_ydot does nothing.

@nrnhines
Copy link
Member Author

As expected, 156120 model with global variable time step has better than 8.2 performance with branch hines/cvscatter-ptr (replaces scatter/gather data_handles with double*)

ZTIME 0 10.6198 11024
ZTIME 1 1.17434 89875

image
I guess the scatter/gather is faster than 8.2 (1 instead of 3s) for the global step because the scattered variables are more adjacent in 9.0.

On the other hand, as expected, with the local variable time step, scatter/gather is improved to be better than 8.2 but overall fadvance time is not as good. Apparently, visiting random 1/100 regions (one cell) of array memory per time step is not good SoA utilization.

ZTIME 0 6.08805 90134
ZTIME 1 0.195485 1136180

image

@nrnhines
Copy link
Member Author

206267/GrC_FHF_ModelDB cvode.use_daspk(1) results with hines/cvscatter-ptr branch are

ZTIME 0 16.1559 2597
ZTIME 1 0.826078 18297

8.2 results are

ZTIME 0 20.0077 2597
ZTIME 1 1.97465 18297

Current master results are

ZTIME 0 29.8548 2597
ZTIME 1 6.97939 18297

So the cvscatter-ptr branch resolves the performance issue.

Note: ZTIME results are obtained by merging https://github.com/neuronsimulator/nrn/tree/hines/ztime-patch into the relevant branch we are considering.

patch to measure time in fadvance and cvode scatter/gather
Prints total times and call counts if exit with h.quit().
Writes ztime.tmp with nanoseconds of last 256 calls.
Use python -i path-to-pltperf.py path-to-ztime.tmp for Graphs.

@nrnhines
Copy link
Member Author

I've noticed a large variation of runtimes (cumulative fadvance) when the same program is run multiple times. e.g. for 267666/publishable_code_4 cvode.use_local_dt(1) for hines/cvscatter-ptr branch, 10 successive runs have fadvance times 22.8639 18.9383 15.3316 19.6556 19.0598 17.6296 16.7625 19.0117 19.177 16.588. So it becomed difficult to compare different simulation times. It turns out that setting Power Mode from Power Saver to Performance, significantly reduces the variation (and improves performance).
14.7517 14.7304 14.8947 14.8335 14.5421 14.9943 14.5099 14.9665 14.6019 14.8015.

For this model, the local vs global time step and 8.2 vs cvscatter become

8.2
local step
ZTIME 0 6.13932 212335
ZTIME 1 0.291111 1584823
global step
ZTIME 0 6.09079 212335
ZTIME 1 0.300304 1591036

cvscatter-ptr
local step
ZTIME 0 14.615 212087
ZTIME 1 0.335679 1584683
global step
ZTIME 0 8.08095 212087
ZTIME 1 0.266084 1591340

For the 267666 model, the call counts are slightly different between 8.2 and cvscatter-ptr even after modifying to use only Random123. It is not clear at the moment what is the origin of this discrepancy.
Notice that, for a given simulator the local step and global step call counts for fadvance are the same. This is expected as this model has only a single cell (with 16 compartments and 162 point processes). So the similarity of fadvance times for 8.2 is expected. What is surprising is the large fadvance time difference between local and global time step for cvscatter-ptr (not explainable by the similar times for scatter/gather). I do notice that this model uses a large number (82) of Vector.record statements so that seems worth examining.

@nrnhines
Copy link
Member Author

Just for future accessibilty, the chatgpt suggestion for setting PowerMode to Performance with bash is

# Function to set CPU governor
set_governor() {
    local governor=$1
    for cpu in /sys/devices/system/cpu/cpu[0-9]*; do
        echo $governor | sudo tee $cpu/cpufreq/scaling_governor > /dev/null
    done
}

# Save the current governor
original_governor=$(cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor)

set_governor performance

# do simulation runs

set_governor $original_governor

An alternative, if installed on your machine, is cpupower help.

@nrnhines
Copy link
Member Author

What is surprising is the large fadvance time difference between local and global time step for cvscatter-ptr (not explainable by the similar times for scatter/gather). I do notice that this model uses a large number (82) of Vector.record statements so that seems worth examining.

Vector.record does not seem to explain very much of the timing differences. Added timing measurement for Cvode::record_continuous() and see (the numbers are seconds and call count)

cvscatter-ptr  localstep
fadvance 14.322 212087
scatter/gather 0.326384 1584683
vec.record 1.75106 210367

cvscatter-ptr  globalstep
fadvance 7.88049 212087 
scatter/gather 0.250792 1591340
vec.record 1.80664 210367

8.2 localstep
fadvance 6.07042 212335 
scatter/gather 0.288091 1584823
vec.record 1.44338 210734

@nrnhines
Copy link
Member Author

Attempting to understand why the 9.2 local variable time step method (with one cell) is about half as fast as the 9.2 global variable time step method.
Obtaining the data for model 267666 by adding Caliper Instumentors in various cvode related functions and toggling cvode.use_local_dt(1) in mosinit.hoc.

Collect the vtune data for the Local variable time step with the commands

 vtune -collect hotspots -knob sampling-mode=hw -knob sampling-interval=1 -result-dir vtune-local-result nrniv mosinit.hoc
vtune-gui vtune-local-result &

If one clicks on an item in the Summary, Top Hotspots, then all the caliper items are shown in the Bottom-up page. Then one can select all the TaskType/Function/CallStack, select "copy rows to clipboard" and paste here.

Local variable time step

Task Type / Function / Call Stack	CPU Time	Instructions Retired	Task Time	Task Count
fadvance	17.584s	194,618,000,000	18.087s	212,121
cv:local_microstep	17.002s	190,736,800,000	17.483s	219,497
cv:handlestep	16.822s	189,031,800,000	17.287s	210,401
cv:advance_tn	14.851s	177,837,700,000	15.277s	207,743
cv:fun_thread	9.232s	111,919,300,000	9.385s	505,867
cv:solvex_thread	4.664s	54,739,800,000	4.920s	284,828
cv:solvemem	2.110s	26,337,600,000	2.203s	284,828
cv:record	1.833s	9,789,800,000	1.875s	210,401
[Unknown]	0.383s	2,963,600,000		
cv:scatter_y	0.315s	2,771,400,000	0.312s	505,867
cv:gather_ydot	0.161s	1,736,000,000	0.189s	577,633
cv:scatter_ydot	0.111s	985,800,000	0.116s	284,828
net-receive-AMPA_DynSyn	0.102s	877,300,000	0.103s	5,576
cv:init	0.063s	582,800,000	0.060s	2,661
cv:interpolate	0.055s	691,300,000	0.059s	5,616
finitialize	0.002s	12,400,000	0.003s	2
cv:gather_y	0.001s	6,200,000	0.001s	2,661
spike-exchange	0s		0.000s	2

Similarly,
Global variable time step

Task Type / Function / Call Stack	CPU Time	Instructions Retired	Task Time	Task Count
fadvance	9.351s	87,224,700,000	9.604s	212,121
cv:global_microstep	9.265s	86,548,900,000	9.521s	216,499
cv:handlestep	8.578s	81,846,200,000	8.812s	212,121
cv:advance_tn	6.207s	67,917,900,000	6.354s	207,743
cv:fun_thread	3.501s	39,459,900,000	3.594s	509,196
cv:record	2.244s	12,719,300,000	2.335s	210,401
cv:solvex_thread	1.773s	17,797,100,000	1.821s	284,828
cv:solvemem	1.003s	10,713,600,000	1.023s	284,828
[Unknown]	0.295s	2,464,500,000		
cv:scatter_y	0.220s	1,863,100,000	0.230s	509,196
cv:gather_ydot	0.173s	1,884,800,000	0.187s	577,633
cv:scatter_ydot	0.114s	812,200,000	0.115s	284,828
cv:interpolate	0.063s	657,200,000	0.063s	5,988
net-receive-AMPA_DynSyn	0.053s	223,200,000	0.049s	5,616
cv:init	0.034s	248,000,000	0.029s	2,660
finitialize	0.003s	9,300,000	0.002s	2
cv:gather_y	0.001s	3,100,000	0.001s	2,660
spike-exchange	0s		0.000s	2

There are a few things that make sense. Because there is only one cell, the TaskCounts are similar for local and global.
cv:record and the cv:scatter/gathers are also similar. But the main cv:advance_tn and its main elements, fun_thread and solvex_thread have more than a factor of two difference for times and instructions retired (even though they are called a similar number of times). That is a puzzle that needs to be resolved. I will repeat this experiment for 8.2 but do not expect much difference between local and global time step with a single cell.

@nrnhines
Copy link
Member Author

Here are the 8.2 results for 267666 . No puzzle.

Local variable time step

Task Type / Function / Call Stack	CPU Time	Instructions Retired	Task Time	Task Count
fadvance	6.684s	59,681,200,000	6.888s	212,690
nc:local_microstep	6.602s	58,946,500,000	6.800s	219,952
cv:handle_step	6.497s	57,997,900,000	6.686s	211,089
cv:advance_tn	4.743s	49,565,900,000	4.908s	208,469
cv:fun_thread	2.444s	26,963,800,000	2.565s	507,007
cv:record	1.643s	7,421,400,000	1.666s	211,089
cv:solvex_thread	1.393s	11,708,700,000	1.416s	285,432
cv:solvemem	0.782s	7,347,000,000	0.763s	285,432
[Unknown]	0.316s	2,424,200,000		
cv:scatter_y	0.256s	1,643,000,000	0.256s	507,007
cv:gather_ydot	0.184s	1,695,700,000	0.222s	578,727
cv:scatter_ydot	0.126s	843,200,000	0.135s	285,432
net-receive-AMPA_DynSyn	0.030s	263,500,000	0.033s	5,540
cv:init	0.016s	155,000,000	0.016s	2,623
cv:interpolate	0.012s	164,300,000	0.016s	5,540
finitialize	0.001s	6,200,000	0.002s	2
cv:gather_y	0s	3,100,000	0.001s	2,623
spike-exchange	0s		0.000s	2
cur-CaIntraCellDyn	0s		0.000s	8
cur-pas	0s		0.000s	23
setup-tree-matrix	0s		0.000s	1
gap-v-transfer	0s		0.000s	2

Global variable time step

Task Type / Function / Call Stack	CPU Time	Instructions Retired	Task Time	Task Count
fadvance	6.542s	59,929,200,000	6.743s	212,690
nc:global_microstep	6.454s	59,151,100,000	6.655s	216,911
cv:handle_step	6.320s	58,218,000,000	6.531s	212,690
cv:advance_tn	4.608s	49,671,300,000	4.768s	208,469
cv:fun_thread	2.475s	27,255,200,000	2.500s	510,115
cv:record	1.597s	7,396,600,000	1.638s	211,089
cv:solvex_thread	1.266s	11,807,900,000	1.360s	285,432
cv:solvemem	0.684s	7,511,300,000	0.743s	285,432
[Unknown]	0.335s	2,371,500,000		
cv:scatter_y	0.238s	1,674,000,000	0.249s	510,115
cv:gather_ydot	0.189s	1,751,500,000	0.199s	578,727
cv:scatter_ydot	0.127s	849,400,000	0.132s	285,432
cv:interpolate	0.036s	331,700,000	0.033s	5,729
net-receive-AMPA_DynSyn	0.024s	65,100,000	0.016s	5,540
cv:init	0.014s	173,600,000	0.016s	2,622
finitialize	0.003s	9,300,000	0.002s	2
cv:gather_y	0.001s	6,200,000	0.001s	2,622
spike-exchange	0s		0.000s	2
cur-CaIntraCellDyn	0s		0.000s	8
cur-pas	0s		0.000s	23
setup-tree-matrix	0s		0.000s	1
gap-v-transfer	0s		0.000s	2

@nrnhines
Copy link
Member Author

The following has to be a huge clue. But I don't fully grasp its implications. The first of each pair of lines is the Bottom-up TaskType/Function/CallStack for first dozen or so descendents of cv:advance_tn in CPUTime order for the Local variable time step. The second of each pair of lines is the corresponding Global variable time step entry with its CPUTime and Instructions Retired.

Task Type / Function / Call Stack	CPU Time	Instructions Retired	Task Time	Task Count
cv:advance_tn	14.851s	177,837,700,000	15.277s	207,743
cv:advance_tn   6.207s  67,917,900,000  6.354s  207,743

  neuron::Model::mechanism_data_impl	1.195s	19,914,400,000		
  neuron::Model::mechanism_data_impl    0.088s  1,181,100,000

  Memb_list::type	0.778s	10,512,100,000		
  Memb_list::type       0.054s  666,500,000

  neuron::container::Mechanism::storage::type	0.606s	5,967,500,000		
  neuron::container::Mechanism::storage::type   0.042s  390,600,000

  __libm_exp_z0	0.521s	6,981,200,000		
  __libm_exp_z0 0.490s  6,696,000,000

  neuron::mechanism::get_field_count<double>	0.393s	4,994,100,000		
  neuron::mechanism::get_field_count<double>    0.023s  294,500,000

  neuron::mechanism::get_data_ptrs<double>	0.313s	3,540,200,000		
  neuron::mechanism::get_data_ptrs<double>      0.033s  238,700,000

  neuron::mechanism::_get::_pdata_ptr_cache_data	0.281s	4,991,000,000		
  neuron::mechanism::_get::_pdata_ptr_cache_data        0.027s  316,200,000

  neuron::mechanism::get_array_dims<double>	0.276s	3,906,000,000		
  neuron::mechanism::get_array_dims<double>     0.024s  263,500,000

  hoc_Exp	0.270s	3,580,500,000		
  hoc_Exp       0.282s  3,512,300,000

  std::vector<std::unique_ptr<neuron::container::Mechanism::storage, 
std::default_delete<neuron::container::Mechanism::storage>>, 
std::allocator<std::unique_ptr<neuron::container::Mechanism::storage, 
std::default_delete<neuron::container::Mechanism::storage>>>>::size	0.261s	4,653,100,000	
  std::vector<std::unique_ptr<neuron::container::Mechanism::storage, 
std::default_delete<neuron::container::Mechanism::storage>>,
std::allocator<std::unique_ptr<neuron::container::Mechanism::storage,
std::default_delete<neuron::container::Mechanism::storage>>>>::size    0.015s 282,100,000	

  func@0x19b6e0	0.248s	1,131,500,000		
  func@0x20370  0.088s  139,500,000  # don't know. Just the first one in the list

  func@0x190910	0.240s	1,202,800,000		
  func@0x1991f0 0.048s  189,100,000 # don't know. Just the second one in the list

  __libm_pow_l9	0.233s	2,799,300,000		
  __libm_pow_l9 0.306s  2,954,300,000

  func@0xa520	0.223s	1,125,300,000		
  func@0x149d0  0.047s  120,900,000 # don't know. Just the third one in the list

  [vmlinux]	0.165s	719,200,000		
  [vmlinux]     0.104s  523,900,000

  Cvode::do_ode	0.152s	1,004,400,000		
  Cvode::do_ode 0.022s  167,400,000

  Cvode::scatter_y    0.133s	762,600,000
  Cvode::scatter_y    0.108s  784,300,000

@nrnhines
Copy link
Member Author

It wasn't clear to me whether relative (between local step and global step) Instructions Retired could be considered a proxy for call count. Explicitly counting the calls into mechanism_data_impl gives a ratio of

call count ratio = local / global = 1171351415 / 72606493 = 16.13

in comparison to above vtune data

Instructions Retired ratio = local/global = 19,914,400,000 / 1,181,100,000 = 16.86

On the other hand, for functions higher up in the call chain, e.g.

  Cvode::do_ode	0.152s	1,004,400,000		
  Cvode::do_ode 0.022s  167,400,000

Hmm. Pursuing the counts into Cvode::do_ode

void Cvode::do_ode(neuron::model_sorted_token const& sorted_token, NrnThread& nt) {
    ZTCNT(3);
    // all the membrane mechanism ode's
    CvodeThreadData& z = CTD(nt.id);
    for (auto* cml = z.cv_memb_list_; cml; cml = cml->next) {  // probably can start at 6 or hh
        if (auto* const ode_spec = memb_func[cml->index].ode_spec; ode_spec) {
            for (auto& ml: cml->ml) {
                ZTCNT(4);
                ode_spec(sorted_token, &nt, &ml, cml->index);
                if (errno && nrn_errno_check(cml->index)) {
                    hoc_warning("errno set during ode evaluation", nullptr);
                }
            }
        }
    }
    long_difus_solve(sorted_token, 1, nt);
}

We see:

global
ZTIME 3 0 509196
ZTIME 4 0 6110352

local
ZTIME 3 0 505867
ZTIME 4 0 105220140

There seems to be some kind of duplication error in range of the two for loops.

@nrnhines
Copy link
Member Author

The origin of this performance issue is indicated by the comments in netcvode.cpp bool NetCvode::init_global() {

    // We copy Memb_list* into cml->ml below. At the moment this CVode code
    // generates its own complicated set of Memb_list* that operate in
    // list-of-handles mode instead of referring to contiguous sets of values.
    // This is a shame, as it forces that list-of-handles mode to exist.
    // Possible alternatives could include:
    // - making the sorting algorithm more sophisticated so that the values that
    //   are going to be processed together are contiguous -- this might be a
    //   bit intricate, but it shouldn't be *too* had to assert that we get the
    //   right answer -- and it's the only way of making sure the actual compute
    //   kernels are cache efficient / vectorisable.
    // - changing the type used by this code to not be Memb_list but rather some
    //   Memb_list_with_list_of_handles type and adding extra glue to the code
    //   generation so that we can call into translated MOD file code using that
    //   type
    // - Using a list of Memb_list with size 1 instead of a single Memb_list
    //   that holds a list of handles?

and the comment of line 1658

   // non-contiguous mode, so we're going to create a lot of 1-element Memb_list

@nrnhines
Copy link
Member Author

nrnhines commented Jan 9, 2025

The https://github.com/neuronsimulator/nrn/tree/hines/cellcontig-scatter-treemat branch exhibits much closer performance to 8.2 but is still not quite there. E.g. 267666 lvardt using intel compiler

version   lvardt 
8.2        5.3
master     16.6
branch     7.0

The main features of this branch are changing the default node data storage sort to contiguous cells and setting/solving the tree matrix for each cell directly using the SoA storage (instead of via Node** access).

@nrnhines
Copy link
Member Author

The branch https://github.com/neuronsimulator/nrn/tree/hines/cellcontig-treemat2 perhaps suffices (for x86_64) to get performance for variable step methods close enough to 8.2 performance. This branch uses direct backing store instead of Node* for cvode special handling of zero area nodes (nodes without capacitance). Performance results for ringtest for the sequence of branches that start with master and end with nocap are:

nrniv -python ringtest.py -npt 1 -compart 1 3 -nring 32 -rparm -nt 1

x86_64 intel icx
version  fixed  global  local
8.2      1.86   30.5    1.35
master   1.53   56.0    6.67
contig   1.59   60.8    2.76
scatter  1.71   48.4    2.32
treemat  1.61   35.9    2.01
nocap    1.6    25.1    1.78

x86_64 gcc 13.3.0
version  fixed  global  local
8.2          2.92   51.9    2.47
master   2.38   63.4    6.19
contig    2.09   63.6    3.10
scatter   1.98   49.8    2.54
treemat  2.01   38.7    2.25
nocap     2.01   28.4    2.01

Note that none of these branches affect the fixed step performance and the variation can perhaps be attributed to test noise.

An unresolved issue is Apple M1 results which are:

Apple M1
version  fixed  global  local
8.2         1.87    23.67   1.32
master  4.1      60.1      5.3
contig    5.0      70.5      4.1
scatter   4.9      64.6      3.8
treemat 4.1      47.5      2.9
nocap    4.9      58.3      3.47
coreneuron 2.0

Performance data for the three ModelDB models that initiated work on this issue are:

x86_64 intel icx
          lvardt lvardt daspk       
version   267666 156120 206267       
8.2       5.6    1.07   11.2
master    16.2   4.8    13.0 
nocap     7.2    1.18   10.3

@nrnhines
Copy link
Member Author

nrnhines commented Jan 21, 2025

I've redone the graphs at the top of this page using 8.2 (078a34a) and 9.0a-506-g96b78c617 hines/cellcontig-treemat2
with intel icx, python3.11 and PowerMode set to Performance

cmake .. -G Ninja -DCMAKE_INSTALL_PREFIX=install -DNRN_ENABLE_PROFILING=ON -DNRN_PROFILER=caliper -DCMAKE_PREFIX_PATH=$HOME/soft/Caliper/build/install

Data gathered (for 8.2) with

export CALI_CONFIG=runtime-report
runmodels --clean --workdir 8.2

Note that --virtual was left off to avoid hanging after about 30% of the models were run.
Also it should be mentioned that the total time for runmodels was about 4 minutes for 8.2 and 17 minutes for 9.0. I speculate
that most of that difference was due to nrnivmodl. The logs can be consulted about this since each model run prints the elapsed time used by nrnivmodl and nrniv.

Image

Image

For these results the models with 8.2 runtime > 8 s that are significantly faster than 9.0 are

model id     8.2              9.0
['206267', 50.729526, 54.365123]     GrC_FHF_ModelDB   uses IDA
['146030', 31.382019, 33.892184]     oconnoretal2012  fixed step
['266578', 31.144368, 44.136395]     ModelDBRepository-266578-b201957 fixed step
['64259', 25.119902, 31.893658]       NCnote  fixed step
['244692', 20.69953, 25.913363]       Appukuttan_et_al_2016_Fig3  fixed step
['267691', 8.8851, 19.598119]           KumaraveluEtAl2022    fixed step

@nrnhines
Copy link
Member Author

Just for completeness, here are the model run times (from nrniv launch to exit, so includes setup time) corresponding to the above graph.

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants