NEST as a SPEC CPU benchmark #3217

heshpdx · 2024-06-07T00:09:53Z

Hello friends,

I’m a CPU architect at Ampere Computing where I do performance analysis and workload characterization. I also serve on the SPEC CPU committee, working on benchmarks for the next version of SPEC CPU, CPUv8 . We try to find computationally intensive workloads in diverse fields, to help measure performance across a wide variety of behaviors and application domains. Based on the longevity of nest, its large active community in biology and its use in education, I have proposed the nest neural network model be included in the next set of marquee benchmarks in SPEC CPU.

As part of the effort, we have ported and integrated the nest mainline code into the SPEC CPU harness so that it can be tested on a wide variety of systems in a controlled environment to produce reproducible results. We have even built it on native Windows using MSVC and the Intel compiler for Windows – we are happy to share the changes if someone is interested in testing and integrating it back into the upstream mainline for the benefit of the community.

The piece we need help with is an understanding of the multithreaded workloads. Right now, we have single-threaded nest command lines which run and produce verifiable output across many compilers (llvm, gcc, icc, aocc, nvhpc, cray), ISAs (aarch64, x86, power) and operating systems (linux, windows, android). We verify the run via checking the .dat files which come out of the simulation runs to make sure that there are no differences in the resulting output. A problem arises when we run with multiple threads, since there are a different number of files produced, and I am unfamiliar with how to coalesce them to verify.

First some fundamental questions: Does a nest invocation with 8 threads perform the same amount of work as a run with 16 threads? Or is it that the problem being solved is larger? If it is the same, how can we verify that? Does this answer change based on the .sli script used?

In the example below, if I run examples/nest/brunel-2000_newconnect_dc.sli (with a small edit to make it run longer)... I tried with 8 threads and 16. It looks like I am simulating the same number of Neurons and Synapses. The 8-thread version outputs 16 files and the 16-threaded version outputs 24 files. The total lines in the files are close. Do they fundamentally contain the same information, just at different sample points?

$ ./nest_s_base.O3-64 --userargs=threads=8 brunel-2000_newconnect_dc_LONG.sli
NEST 3.5.0-post0.dev0 (C) 2004 The NEST Initiative
Configuring neuron parameters.
Creating the network.
Configuring neuron parameters.
Creating excitatory spike recorder.
Creating inhibitory spike recorder.
Connecting excitatory neurons.
Connecting inhibitory population.
Connecting spike recorders.
Starting simulation.

Brunel Network Simulation
Number of Threads : 8
Number of Neurons : 95000
Number of Synapses: 902500100
       Excitatory : 722000000
       Inhibitory : 180500000
Excitatory rate   : 6.99 Hz
Inhibitory rate   : 6.825 Hz
Building time     : 23.76 s
Simulation time   : 448.07 s

$ ls -l brunel-2-in-threaded-95002-* | wc -l
16

$ wc -l brunel-2-in-threaded-95002-* | tail -1
 2811 total

$ ./nest_s_base.O3-64 --userargs=threads=16 brunel-2000_newconnect_dc_LONG.sli
NEST 3.5.0-post0.dev0 (C) 2004 The NEST Initiative
Configuring neuron parameters.
Creating the network.
Configuring neuron parameters.
Creating excitatory spike recorder.
Creating inhibitory spike recorder.
Connecting excitatory neurons.
Connecting inhibitory population.
Connecting spike recorders.
Starting simulation.

Brunel Network Simulation
Number of Threads : 16
Number of Neurons : 95000
Number of Synapses: 902500100
       Excitatory : 722000000
       Inhibitory : 180500000
Excitatory rate   : 6.985 Hz
Inhibitory rate   : 7.36 Hz
Building time     : 35.39 s
Simulation time   : 332.74 s

$ ls -l brunel-2-in-threaded-95002-* | wc -l
24

$ wc -l brunel-2-in-threaded-95002-* | tail -1
 2909 total

Overall, the goal is to be able to verify that the same amount of work was completed between these two command lines, and verify that they calculated the same result. This allows a benchmark to run on systems with a varying number of hardware cores, so we can measure CPU performance between them. We are allowed to provide some tolerance, in case there is floating point rounding error.

For the multithreaded benchmark, I am exercising the scripts below. The goal is to showcase scalable threading performance, as well as cover a variety of behaviors in the nest simulator.

brunel-2000_newconnect_dc_LONG.sli
brunel_ps_LONG.sli
hpc_benchmark.sli
microcircuit.sli

If you have feedback on which are more or less useful as multithreaded benchmarks, please share your thoughts!

Thank you!

The text was updated successfully, but these errors were encountered:

heplesser · 2024-06-09T18:19:43Z

Hello Mahesh!

We are very excited that you have proposed NEST for inclusion in the SPEC CPUv8 benchmark suite! We would be happy to work with you to make this happen. I will answer your more specific questions below.

One important point about NEST (and other neuronal network simulators) is that you can run a wide range of neuronal network models on the NEST simulator, constructing the networks either through SLI or through Python scripts. And if networks are defined in the PyNN specification language, they can be executed on a range of neuronal network simulators, including some neuromorphic systems. Therefore, there isn't "the nest neural network model". The advantage of this is that one can configure networks that are suitable for benchmarking.

In our own work, we have mainly used the hpc_benchmark.sli and the microcircuit.sli benchmarks. The latter has become a reference benchmark also for neuromorphic and GPU-based simulators (I will post references later).

Concerning the specific Brunel-benchmark you used: Increasing the number of threads will distribute the same workload across more threads, i.e., strong scaling.

I am a bit confused by the number of output files you report. When running with eight threads, I would expect eight brunel-2-ex-threaded-... and eight brunel-2-ex-threaded-... files, and with 16 threads correspodingly 16 ex and 16 in files. Could you double check that?

I noticed that you used a NEST 3.5.0-post0.dev0 version from Github. We made some substantial improvements in threaded scaling in NEST 3.6, so I would strongly suggest that you update to at least NEST 3.6, ideally to NEST 3.7. With NEST 3.6, we see excellent scaling for the "microcircuit" benchmark up to 128 threads on dual AMD Epyc Rome systems.

Best, Hans Ekkehard

heshpdx · 2024-06-10T04:39:20Z

Thanks for responding, Hans!

I had a feeling you would ask me to rebase, so I attempted that three-way merge right after I posted above. The SPEC CPU harness builds applications in a totally different manner, so the process of adding a benchmark requires taking humpty-dumpty apart and putting him back together. Something may have gotten lost in that process, because after my merge, I get this runtime error immedately:

$ ./nest_r_base.O3-64 

sli-init Fatal []: 
    While executing module initializer: {(nest-init) run}

load Error []: UndefinedName

Start Error []: 
    Something went wrong during initialization of NEST or one of its modules. 
    Probably there is a bug in the startup scripts. Please report the output 
    of NEST at https://github.com/nest/nest-simulator/issues . You can try to 
    find the bug by starting NEST with the option --debug

Start Error []: 
    Something went wrong during initialization of NEST or one of its modules. 
    Probably there is a bug in the startup scripts. Please report the output 
    of NEST at https://github.com/nest/nest-simulator/issues . You can try to 
    find the bug by starting NEST with the option --debug

Here is the output from ./nest --debug: debug.log

Can you provide some hints as to what I should look at? I am building all the modules as well as models/models.cpp and models/modelsmodule.cpp. Is there a config or other flag I should be aware of?

heshpdx · 2024-06-10T04:45:24Z

On the topic of workloads, thank you for the feedback on the benchmark scripts.

When running with eight threads, I would expect eight brunel-2-ex-threaded-... and eight brunel-2-ex-threaded-... files, and with 16 threads correspodingly 16 ex and 16 in files. Could you double check that?

Indeed, you are correct! I counted wrong, sorry about that. So the question is, how can I use those files to prove the same work occurred between the two runs, and check that they both came to the "correct answer"? SPEC CPU benchmarks need to scale from 1 thread to N threads, and verify the results are the same using any N. This is the challenge in front of us.

For the single-threaded benchmark, I am using these scripts which seem to work well. Are they ok?

cuba_stdp.sli
structural_plasticity_benchmark.sli
ArtificialSynchrony.sli

heshpdx · 2024-06-10T04:51:33Z

Ahoy! Sometimes crafting a detailed message to the author is all that is required to solve one's problem. In making the debug.log and looking at the first line of the output, I realized that I had forgotten to update the nest/sli/sli-init.sli and other runtime files to your latest mainline. I did that and the program was able to run again! I will work on testing all my cmdlines with my 3.7 rebase. Thanks!

heshpdx · 2024-06-10T04:58:45Z

Regarding microcircuit.sli, we noticed that there is a bug if we try to run with more than 63 threads. It has something to do with a file not being able to be created. We got around that issue by capping the max thread count in that script. Is this a known limitation?

diff -u Potjans_2014/microcircuit.sli ~/867.nest_s/data/refspeed/input/microcircuit.sli
--- Potjans_2014/microcircuit.sli     2024-06-07 16:31:55.464111605 +0000
+++ ~/867.nest_s/data/refspeed/input/microcircuit.sli   2024-05-10 16:28:22.329645268 +0000
@@ -137,6 +137,14 @@
 
 } def
 
+% This model only supports up to 63-way concurrency
+/maxthreads 63 def
+/Truncate63
+{
+  dup maxthreads gt {
+     pop maxthreads
+  } if
+} def
 
 % PrepareSimulation - Set kernel parameters and RNG seeds based on the
 % settings in sim_params.sli.
@@ -148,9 +156,9 @@
     % set global kernel parameters
     <<
        /resolution dt
-       /total_num_virtual_procs n_vp
+       /local_num_threads local_num_threads Truncate63
+       /total_num_virtual_procs n_vp Truncate63
        /overwrite_files overwrite_existing_files
        /rng_seed rng_seed
        output_path (.) neq {
            /data_path output_path
        } if

heplesser · 2024-06-10T12:32:00Z

@heshpdx NEST creates a separate spike recorder instance for each thread to allow non-blocking recording and each of these instances opens a separate file. Thus, the brunel....sli script running on 64 threads would open 128 output files. Could it be that you hit an operating system limit on the number of open files per process?

I presume that with SPEC CPU you want to test CPU performance rather than I/O system performance. If this is indeed the case, I would suggest to drop the spike time recording to file. This would solve the problem with limited numbers of files. If SPEC rules should allow you to do one run with some form of recording of spikes and another run without it, one could use the following approach: Perform a run with spike recorders in which the spikes are recorded to memory. After then simulation time is up, extract spike data from the spike recorder in the SLI script and write it to file. This only requires one file per process instead of one per thread. Also, read out the spike counter in NEST. Then turn off spike recording completely and re-run the simulation. Read out the spike counter (it is always active) and you should get exactly the same number of spikes as when recording. I can send you code to do this.

When simulating the same network with different numbers of threads, you will obtain results that differ in detail, since the random number sequences in NEST are thread-specific. We have developed several measures to verify that simulations do produce statistically consistent results, see the paper by Gutzen et al below. If you positively need identical results independent of the number of threads used, we can develop a work-around for that.

For strong-scaling experiments, I would encourage you to use the microcircuit model, as it is currently the most widely used network model used for benchmarking, see the paper by Kurth et al below. That paper also gives a rather recent description of the state of the art, although NEST 3.6 and later now outperform NEST 2.14 that was used in that paper.

Concerning a single-thread benchmark, what are the constraints on running a single benchmark? On my MacBook with Core i5, hpc_benchmark.sli takes in total about two minutes for a single run. If that is acceptable, I would suggest using hpc_benchmark.

heplesser · 2024-06-10T12:35:04Z

And here the promised references:

Gutzen, R., Von Papen, M., Trensch, G., Quaglio, P., Grün, S., & Denker, M. (2018). Reproducible Neural Network Simulations: Statistical Methods for Model Validation on the Level of Network Activity Data. Frontiers in Neuroinformatics, 12, 90. https://doi.org/10.3389/fninf.2018.00090
Kurth, A. C., Senk, J., Terhorst, D., Finnerty, J., & Diesmann, M. (2022). Sub-realtime simulation of a neuronal network of natural density. Neuromorphic Computing and Engineering, 2(2), 021001. https://doi.org/10.1088/2634-4386/ac55fc

heshpdx · 2024-06-10T13:11:01Z

You are correct we want to be CPU bound and not IO bound. But even with the current code, outputting a file per thread, we are quite CPU bound.

Your spike recorder in memory idea is very good. If you can coalesce that using sli directives at the end of the simulation, that would solve our problem. The one limitation we have is that the run should fit within 64 GB of virtual memory (for the parallel runs). I imagine that is big enough.

Please do work on that when you get a chance. I recognize that the NEST conference is coming up and you will be busy with that!

The thread count limitation was only seen in microcircuit.sli. I'll see if I can dig up why. Most OSs have a default open file size limit of 4096 so I think the issue may be more esoteric.

I'll take a look at hpc benchmark for single threaded runs. I wanted to have a wide variety of code coverage and CPU behavior in three or four workloads. The time limit for single threaded is 3-5 minutes on a "modern CPU" and it needs to fit within 1.8 GB of memory.

heshpdx · 2024-06-10T19:06:31Z

Regarding exact answers - we do allow some tolerance based on floating point rounding. If there is a lot of randomness, we can try to reduce that via more deterministic randomness. I've seen that in many benchmark candidates, less (or zero) randomness still provides valid answers and doesn't detract from creating a benchmark representative of the application behavior in the field.

With my latest rebase I still get issues with running Potjans_2014/microcircuit.sli with more than 63 threads (in the SPEC CPU harness):

$ ./nest_s_base.O3-64 --userargs=threads=80 microcircuit.sli
NEST 3.7.0-post0.dev0 (C) 2004 The NEST Initiative

SimulationManager::set_status Info []: 
    Temporal resolution changed from 0.1 to 0.1 ms.

NodeManager::prepare_nodes Info []: 
    Preparing 79729 nodes for simulation.

RecordingBackendASCII::prepare() Error []: 
    I/O error while opening file 'spikes_3_0-74246-63.dat'.

Simulate_d Error []: IOError

I have enough file descriptors:

$ ulimit -Sn
1024
$ ulimit -Hn
1048576

I played around with the mainline nest build, and realized the issue is not with threads, but with /total_num_virtual_procs. I was able to recreate the problem on the mainline by increasing Potjans_2014/microcircuit.sli's /total_num_virtual_procs to be 80 and it fails the same way as above. Can you share the correct way to scale up the parallelism in microcircuit.sli? What is your cmdline and how would should I set the parallelism to 32, and to 80?

heplesser · 2024-06-10T20:41:55Z

I had another look at microcircuit.sli and it creates a lot of files. The model consists of four cortical layers with two populations each, i.e., a total of eight populations. For each population, the script creates one spike recorder and one voltmeter, thus a total of 16 recording devices (implemented like this because it makes analysis easier later as one does not need to split data into populations later). Each of these devices is replicated once per thread, so for 64 threads you would get 1024 files, for 80 threads 1280 and thus hit your soft limit for file descriptors. To turn this off, open sim_param.sli and set the three variables /save_... to false (around line 92). Spikes will then be stored in memory and no files should be opened.

Concerning setting the number of threads, I am surprised that --userargs=threads=80 had an effect. Did you make some changes to microcircuit.sli to make this work? Since NEST combines MPI- and thread-based parallelization, we have

the number of MPI ranks M
the number of threads T per rank
the number of virtual processes N_VP = M * T

M is defined via mpirun. Inside a NEST script you can either set T through the parameter local_num_threads or N_VP through the parameter total_num_virtual_procs, which must be divisible by M. Since you work with M=1 only, both approaches are equivalent and you can stick with setting total_num_virtual_procs as is used microcircuit.sli. NEST guarantees that one gets identical results for fixed N_VP, no matter how it is split between M and T.

The set of files in Potjans_2014 assumes that parameters are modified in sim_params.sli and network_params.sli, respectively, but I assume that for your purposes, you prefer to set parameters on the command line?

One more question: We are mostly using Python scripts in our benchmarking now. Would that be feasible for your benchmarks as well or does that introduce to many confounding factors so that you would want to avoid Python?

heshpdx · 2024-06-11T04:09:01Z

Oh yes, I did change microcircuit.sli. In the SPEC CPU harness, a variable is passed down indicating how many threads to use. So I need to employ that on a command line, and that is the amount of parallelism that will be requested. I didn't know the difference between threads per rank and virtual processors. For us, we get one invocation of the command line, and that one process can only spawn up to N threads. MPI is not used, python is not used. This is all C++ and cmdline, which is why the sli scripts work so well. OpenMP is also allowed, but I opted to throw in the threads variable along the cmdline and modify the scripts to support it, or so I thought! Getting python inside the harness is difficult or impossible, so I would prefer not to change now. We are already have things working very well with the sli. So one thing you can help with is to craft the Potjans scripts in the proper way, so it could run in the harness with N threads specified on the cmdline. Is that possible? I can use that solution to make sure I am doing the right thing for hpc_benchmark. Thanks!

heplesser · 2024-06-11T05:28:26Z

Thanks for the explanations. I will create a version of microcircuit.sli that will allow you to set the number of threads on the command line and also to choose whether to record spikes or not, as well as the size of the network simulated. You could then use a downscaled version (10–20% of the full model) for the single-threaded case and the full model for scaling.

Two more question concerning the SPEC CPU setup: Can you use thread-aware allocators such as jemalloc and can you pin threads to cores? We have seen significant benefits of both in our benchmarks. In one case, we even found that we needed to change BIOS settings to ensure consistent pinning.

heshpdx · 2024-06-11T12:58:22Z

Thank you! Yes these are the two main issues. Outputting all the data into a fixed number of files will solve the problem.

Thread affinity is allowed, but that is done at a higher level in the harness, so the user must choose the same pinning scheme for all benchmarks.

Custom memory allocators are also allowed. Some benchmarks show gains when linking with jemalloc or tcmalloc, others don't.

heplesser · 2024-06-11T23:07:29Z

Hi @heshpdx!

I have now created a modified version of microcircuit.sli which you can find here:

https://github.com/heplesser/nest-simulator/blob/fix-3217-spec/examples/nest/Potjans_2014/microcircuit_spec.sli

It is monolithic, i.e., you need only this one SLI file. Parameters can be passed on the command line like this (all optional):

nest --userargs=threads=8:scale=0.4:seed=12345 microcircuit_spec.sli

It does not write any spikes to files, but write a short report at the end to stdout:

-----------------------------
Simulation report
-----------------------------
Number of threads: 8
Scale            : 0.4
Network size     : 30882
Num connections  : 119605898
Simulated time   : 1000 ms
Total spikes     : 109493
Population spikes: [[10910 8421] [38636 14079] [16251 3848] [7539 9809]], sum: 109493

Time create nodes: 0.011213 s
Time connect     : 49.2629 s
Time prepare     : 4.53553 s
Time simulate    : 23.8674 s
-----------------------------

The "Total spikes" and the "sum" at the end of the following line are measured in two different ways, but should be the same.
When running with different seeds or thread numbers, actual numbers will vary, see SLI script for some example numbers.
The proportions in spike numbers between the eight populations must remain the same (up to statistical fluctuations) when seeds or thread numbers are changed.

I hope this is suitable for SPEC use. With scale 0.4, it should be reasonable for a single thread, with scale 1.0 for larger numbers of threads. As I mentioned earlier, NEST benefits from jemalloc/tcmalloc and similar, and from thread pinning.

heshpdx · 2024-06-12T03:57:00Z

Thanks! I will try this out. For the verification to be scalable, we will change the output to look like the following (which might require commenting out some of the source):

-----------------------------
Simulation report
-----------------------------
Scale            : 0.4
Network size     : 30882
Num connections  : 119605898
Simulated time   : 1000 ms
Total spikes     : 109493
Population spikes: [[10910 8421] [38636 14079] [16251 3848] [7539 9809]], sum: 109493
-----------------------------

Basically, cut out any run-specific information such as thread count and time. Then we can diff this output while applying a tolerance to the spike numbers (if we need to).

Would it be easy to do this for hpc_benchmark too? I will take a look tomorrow.

heplesser · 2024-06-12T06:49:18Z

To change the output, you only need to comment out or remove a few lines at the end of the script. The "Population spikes" line is only there as a double-check, "Total spikes" should suffice to check that the simulation is running correctly. "Network size" and "Num connections" depend on scale, but not on the number of threads or on the random seed. For actual benchmark runs, you may want to pass "record=false" as userarg. Then, no spike recorders are created and you will only get the "Total spikes", not the "Population spikes". I just pushed another update to the script so that it, if recording, records always from all neurons, removing a potential source of confusion.

It should be rather straightforward to apply the same user argument handling (at the beginning of the script) and reporting (at the end) to hpc_benchmark. Them main difference is that hpc_benchmark only has two neuron populations, E(xcitatory) and I(nhibitory). Let me know if you run into problems!

heshpdx · 2024-06-13T05:12:43Z

Thank you, this looks good. If I run it multiple times with the same cmdline, I get the same results. When I run it with varying number of threads, the Total spikes and Population spikes aren't exact but they are close. What kind of delta would you expect? Here is some data using --userargs=threads=N:scale=.3:seed=108:record=true microcircuit_spec.sli and varying N. I am just showing the numbers that are changing.

N=13, binary built with GCC-12 -O3
    Number of local nodes: 23355
Total spikes     : 83873
Population spikes: [[8740 6585] [29131 10682] [12872 2916] [5495 7452]], sum: 83873

N=21, with GCC-12 -O3
    Number of local nodes: 23483
Total spikes     : 83266
Population spikes: [[8634 6305] [29095 10586] [12701 2920] [5607 7418]], sum: 83266

N=21, with LLVM-16 -O3
    Number of local nodes: 23483
Total spikes     : 82910
Population spikes: [[8269 6292] [29086 10566] [12745 2918] [5640 7394]], sum: 82910

N=28, with GCC-12 -O3
    Number of local nodes: 23483
Total spikes     : 82059
Population spikes: [[8689 6498] [29025 10433] [11816 2892] [5361 7345]], sum: 82059

N=58, with GCC-12 -O3
    Number of local nodes: 24075
Total spikes     : 84633
Population spikes: [[9307 6772] [28882 10516] [13149 2942] [5525 7540]], sum: 84633

N=58, with LLVM-16 -O3
    Number of local nodes: 24075
Total spikes     : 83191
Population spikes: [[8575 6375] [29072 10464] [12682 2925] [5609 7489]], sum: 83191

N=160, with GCC-12 -O3
    Number of local nodes: 25707
Total spikes     : 82472
Population spikes: [[9192 6471] [28472 10390] [12376 2929] [5269 7373]], sum: 82472

N=160, with LLVM-16 -O3
    Number of local nodes: 25707
Total spikes     : 82667
Population spikes: [[8948 6371] [28917 10423] [12360 2917] [5340 7391]], sum: 82667

It appears that the number of local nodes is growing as I add more threading. My machine maxes out with 160 hardware cores, and that shows the most number of local nodes, although the "network size" is invariant across all these runs at 23163. Is this ok? Are any of the output values "out of bounds" in your opinion? It looks like we need a tolerance of about 3% for the spike counts and sum.

Is there a way to output just the "Simulation Report" to a separate file instead of stdout? That way I can also avoid lines like Number of OpenMP threads: 13 which I would have to mask out. If not, I can hack the source to remove that stanza.

Also, I diffed this script against the microcircuit.sli in your mainline. There are a lot of changes! I'm wondering, could you apply the same kind of changes to hpc_benchmark? Humbly speaking, it would be a lot faster for you to do it correctly than for me to learn it.

Finally, I have been using BrodyHopfield.sli and brunel_ps.sli as the "test" and "train" workloads which are meant to be very small and kind of small, respectively. Can you help print the same kind of simulation report at the end of those tests to facilitate verification? I tried copying the stanzas from microcircuit.sli, but I am not sure which variables these simulations know about. (alternatively I can scale down microcircuit.sli to also be the test workload, but I would like to have some more code coverage, and not have the "training" workload be exactly the same as the reference benchmark).

Thank you!

heplesser · 2024-06-13T14:32:45Z

The numbers look good to me and the variation entirely reasonable. 3% is certainly an acceptable tolerance.
"Network size" counts all neurons (~ 77169 * scale), poisson generators (8) and spike recorders (8). Since the number of neurons is scaled individually for each of the 8 populations, there will be a little bit of round-off from casting doubles to ints, so 23163 is as expected for scale 0.3. This number is invariant under the number of threads.
"Number of local nodes" also includes the replicas for each poisson generator and spike recorder on each thread, so it should be 2 * 8 * (N-1) larger than "network size" to account for the replicas on threads 1..N-1.
The report can be written to file. Do you have a preferred naming scheme for the output file?
Concerning hpc_benchmark, BrodyHopfield and brunel_ps: Do you need all three or would two also do?

heshpdx · 2024-06-13T16:57:55Z

Thanks for verifying these are ok!
Naming scheme for report: doesn't really matter. Perhaps scriptname.sli would output scriptname.rpt ?
Other scripts: ideally I would like all three. I've learned from developing other benchmarks for SPEC that there is always a chance that one workload will not work on a platform for an esoteric reason. I like having backup so we are not caught in a hole a few months from now when deadlines are looming.

heplesser · 2024-06-16T19:02:01Z

@heshpdx I have just pushed a new version of microcircuit_spec.sli which writes the simulation report to microcircuit_spec.rpt. I will look at the other examples later this week. I would suggest not to use BrodyHopfield or brunel_ps. To get different computational loads than with microcircuit, I think hpc_benchmark makes sense (fewer synchronization points than microcircuit due to longer delays) and a model with plastic synapses (more complex memory access patterns due to need to inspect spike history for spike-time dependent plasticity). I will look for a suitable model.

heshpdx · 2024-06-18T21:04:36Z

I pulled your code and started playing around with the scripts. I also see hpc_benchmark and cuba_stdp have spec versions, thank you. I had to make this small change to hpc to avoid a syntax error.

diff --git a/examples/nest/hpc_benchmark_spec.sli b/examples/nest/hpc_benchmark_spec.sli
index 68b85e2b3..f2b7b23f7 100644
--- a/examples/nest/hpc_benchmark_spec.sli
+++ b/examples/nest/hpc_benchmark_spec.sli
@@ -342,15 +342,11 @@ RunSimulation
   (Scale            : ) <- scale <- endl
   (Network size     : ) <- ks /network_size get <- endl
   (Num connections  : ) <- ks /num_connections get <- endl
-  (Simulated time   : ) <- t_sim add <- ( ms) <- endl
+  (Simulated time   : ) <- t_sim <- ( ms) <- endl
   (Total spikes     : ) <- ks /local_spike_counter get <- endl
   (Average rate     : ) <- ks /local_spike_counter get cvd NE NI add cvd div t_sim div 1000 mul <- ( sp/s) <- endl

I'll try out some values for scale so that the simulation runtime fits within our time and memory budget!

heshpdx · 2024-06-19T03:51:16Z

I played around with microcircuit, cuba_stdp, and hpc_benchmark. The first two are resilient to various system changes and can verify within 3% delta. hpc_benchmark gives a drastically different answer between GCC and LLVM.

I ran the models with 80 threads and a Scale which requires about 15 Gb of memory. About 3 minutes of runtime gives:

$ cat gcc-12/hpc_benchmark_spec.rpt
Scale            : 3.2
Network size     : 36001
Num connections  : 405036000
Simulated time   : 4000 ms
Total spikes     : 15322083
Average rate     : 106.403 sp/s

$  diff gcc-12/hpc_benchmark_spec.rpt llvm-16/hpc_benchmark_spec.rpt
5,6c5,6
< Total spikes     : 15322083
< Average rate     : 106.403 sp/s
---
> Total spikes     : 12307858
> Average rate     : 85.4712 sp/s

Do you have insight into why LLVM/clang is so different? We have gone through the source and removed all unstable algorithms and other sources of differences we know of. For example, we replaced std::sort with std::stable_sort, replaced the uniform_distribution/poisson_distribution calls with the LLVM versions as compiled code, and replaced all calls to std::rand with our own "specrand" generator. Usually this does the trick, but there is something else afoot. I would love to get hpc_benchmark working since you mentioned above it is the best one. But I am glad that we have cuba_stdp and microcircuit working at least.

heshpdx · 2024-06-19T04:03:30Z

Some more datapoints, first one from gcc-12 -Ofast -fno-finite-math-only and second one from same gcc-12 binary, and both of these are running 40 threads. This is good because it tells me it isn't just a gcc vs clang issue. It turns out, both of these verify correctly with 80 threads, but when run with 40 threads there are fewer spikes. This points to a workload issue.

$ diff gcc-12/hpc_benchmark_spec.rpt gcc-12-Ofast/hpc_benchmark_spec.rpt
5,6c5,6
< Total spikes     : 15322083
< Average rate     : 106.403 sp/s
---
> Total spikes     : 9233948
> Average rate     : 64.1246 sp/s

$ diff gcc-12/hpc_benchmark_spec.rpt gcc-12-40t/hpc_benchmark_spec.rpt
5,6c5,6
< Total spikes     : 15322083
< Average rate     : 106.403 sp/s
---
> Total spikes     : 13967592
> Average rate     : 96.9972 sp/s

heplesser · 2024-06-19T20:18:26Z

Due to the plasticity in the network, simulations of hpc_benchmark can take very different trajectories, especially if simulated for more than 300 ms. This can lead to wildly differing firing rates, which generally are far higher than biologically plausible rates (around 10 sp/s).

To avoid this problem, I have just pushed a version of hpc_benchmark in which the learning rate is set to zero by default, so that synaptic weights remain constant. The simulation still goes through the mechanics of the spike-time dependent plasticity mechanism, but the dynamics will now stay much more stable. Simulation times should also come down noticeably, since firing rates remain reasonable. Furthermore, variations for different seeds/thread numbers/compilers should now be only a few percent.

The variation in rates which you observed between GCC and Clang and between different thread numbers is most likely due to different random number sequences generated (poisson random variates). To test whether differences you observe between different thread numbers/compilers are reasonable, you should run simulations with five different random seeds and note the variation in spike numbers you get. Changing the number of threads or the compiler should lead to variations in spike numbers comparable to what you see for different seeds.

heshpdx · 2024-06-19T22:01:25Z

Great, thank you. I had increased the simulation time because it seemed that the majority of the runtime was in the setup phase when the network is being constructed (especially for larger Scale networks), and I wanted to make sure we are benchmarking the simulation as well.

heshpdx · 2024-06-20T05:09:56Z

The new hpc script works well. I had to increase tolerance from 3% to 4% to empirically allow results from 160-thread runs that looked ok otherwise. I did notice that the rng_seed value printed in the report is always 1. Maybe it is printing a bool instead of the value?

heplesser · 2024-06-20T09:13:55Z

4% tolerance sounds perfectly fine. I am surprised that you always get 1 for the rng_seed. In my tests, I see the seed value that I pass as seed=123 in the userargs.

heshpdx · 2024-06-20T16:23:37Z

My apologies, the seed issue is my own fault! I had changed the source last year when I ingested the code into the harness, to peg all runs with the same seed. Let me think if I want to open that up again, or keep it fixed for all time.

void
nest::RandomManager::set_status( const DictionaryDatum& d )
{
  nest_long_t rng_seed;
  bool rng_seed_updated = updateValue< nest_long_t >( d, names::rng_seed, rng_seed );

#ifdef SPEC
  base_seed_ = 1;
#else
  if ( rng_seed_updated )
  {
    if ( not( 0 < rng_seed and rng_seed < ( 1L << 32 ) ) )
    {
      throw BadProperty( "RNG seed must be in (0, 2^32-1)." );
    }

    base_seed_ = static_cast< std::uint32_t >( rng_seed );
  }
#endif

heplesser · 2024-06-21T06:07:23Z

I see your point of fixing the seed for consistency in benchmarks. But NEST has a fixed default seed already, see

nest-simulator/nestkernel/random_manager.cpp

Line 44 in 358c7cc

const std::uint32_t nest::RandomManager::DEFAULT_BASE_SEED_ = 143202461;
nest-simulator/nestkernel/random_manager.cpp

Line 81 in 358c7cc

base_seed_ = DEFAULT_BASE_SEED_;

The advantage of a configurable seed is that you can check for the range of variation in simulation results for different random number sequences.

heshpdx · 2024-06-21T17:29:52Z

Yes, I agree. I reverted my change so as to gain the flexibility. Things are running well now. I will let you know how it goes!

heplesser · 2024-06-26T12:16:01Z

@heshpdx Could you contact me by email ([email protected])?

heplesser self-assigned this Jun 9, 2024

heplesser added T: Discussion Still searching for the right way to proceed / suggestions welcome S: High Should be handled next I: No breaking change Previously written code will work as before, no one should note anything changing (aside the fix) labels Jun 9, 2024

heplesser changed the title ~~nest as a SPEC CPU benchmark~~ NEST as a SPEC CPU benchmark Jun 10, 2024

heplesser assigned ackurth Jun 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NEST as a SPEC CPU benchmark #3217

NEST as a SPEC CPU benchmark #3217

heshpdx commented Jun 7, 2024

heplesser commented Jun 9, 2024

heshpdx commented Jun 10, 2024

heshpdx commented Jun 10, 2024

heshpdx commented Jun 10, 2024

heshpdx commented Jun 10, 2024

heplesser commented Jun 10, 2024

heplesser commented Jun 10, 2024

heshpdx commented Jun 10, 2024

heshpdx commented Jun 10, 2024

heplesser commented Jun 10, 2024

heshpdx commented Jun 11, 2024

heplesser commented Jun 11, 2024

heshpdx commented Jun 11, 2024

heplesser commented Jun 11, 2024

heshpdx commented Jun 12, 2024

heplesser commented Jun 12, 2024

heshpdx commented Jun 13, 2024 •

edited

Loading

heplesser commented Jun 13, 2024

heshpdx commented Jun 13, 2024

heplesser commented Jun 16, 2024

heshpdx commented Jun 18, 2024

heshpdx commented Jun 19, 2024

heshpdx commented Jun 19, 2024 •

edited

Loading

heplesser commented Jun 19, 2024

heshpdx commented Jun 19, 2024

heshpdx commented Jun 20, 2024

heplesser commented Jun 20, 2024

heshpdx commented Jun 20, 2024

heplesser commented Jun 21, 2024

heshpdx commented Jun 21, 2024

heplesser commented Jun 26, 2024

NEST as a SPEC CPU benchmark #3217

NEST as a SPEC CPU benchmark #3217

Comments

heshpdx commented Jun 7, 2024

heplesser commented Jun 9, 2024

heshpdx commented Jun 10, 2024

heshpdx commented Jun 10, 2024

heshpdx commented Jun 10, 2024

heshpdx commented Jun 10, 2024

heplesser commented Jun 10, 2024

heplesser commented Jun 10, 2024

heshpdx commented Jun 10, 2024

heshpdx commented Jun 10, 2024

heplesser commented Jun 10, 2024

heshpdx commented Jun 11, 2024

heplesser commented Jun 11, 2024

heshpdx commented Jun 11, 2024

heplesser commented Jun 11, 2024

heshpdx commented Jun 12, 2024

heplesser commented Jun 12, 2024

heshpdx commented Jun 13, 2024 • edited Loading

heplesser commented Jun 13, 2024

heshpdx commented Jun 13, 2024

heplesser commented Jun 16, 2024

heshpdx commented Jun 18, 2024

heshpdx commented Jun 19, 2024

heshpdx commented Jun 19, 2024 • edited Loading

heplesser commented Jun 19, 2024

heshpdx commented Jun 19, 2024

heshpdx commented Jun 20, 2024

heplesser commented Jun 20, 2024

heshpdx commented Jun 20, 2024

heplesser commented Jun 21, 2024

heshpdx commented Jun 21, 2024

heplesser commented Jun 26, 2024

heshpdx commented Jun 13, 2024 •

edited

Loading

heshpdx commented Jun 19, 2024 •

edited

Loading