-
Notifications
You must be signed in to change notification settings - Fork 358
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NEST as a SPEC CPU benchmark #3217
Comments
Hello Mahesh! We are very excited that you have proposed NEST for inclusion in the SPEC CPUv8 benchmark suite! We would be happy to work with you to make this happen. I will answer your more specific questions below. One important point about NEST (and other neuronal network simulators) is that you can run a wide range of neuronal network models on the NEST simulator, constructing the networks either through SLI or through Python scripts. And if networks are defined in the PyNN specification language, they can be executed on a range of neuronal network simulators, including some neuromorphic systems. Therefore, there isn't "the nest neural network model". The advantage of this is that one can configure networks that are suitable for benchmarking. In our own work, we have mainly used the Concerning the specific Brunel-benchmark you used: Increasing the number of threads will distribute the same workload across more threads, i.e., strong scaling. I am a bit confused by the number of output files you report. When running with eight threads, I would expect eight I noticed that you used a Best, Hans Ekkehard |
Thanks for responding, Hans! I had a feeling you would ask me to rebase, so I attempted that three-way merge right after I posted above. The SPEC CPU harness builds applications in a totally different manner, so the process of adding a benchmark requires taking humpty-dumpty apart and putting him back together. Something may have gotten lost in that process, because after my merge, I get this runtime error immedately:
Here is the output from Can you provide some hints as to what I should look at? I am building all the modules as well as |
On the topic of workloads, thank you for the feedback on the benchmark scripts.
Indeed, you are correct! I counted wrong, sorry about that. So the question is, how can I use those files to prove the same work occurred between the two runs, and check that they both came to the "correct answer"? SPEC CPU benchmarks need to scale from 1 thread to N threads, and verify the results are the same using any N. This is the challenge in front of us. For the single-threaded benchmark, I am using these scripts which seem to work well. Are they ok?
|
Ahoy! Sometimes crafting a detailed message to the author is all that is required to solve one's problem. In making the debug.log and looking at the first line of the output, I realized that I had forgotten to update the |
Regarding
|
@heshpdx NEST creates a separate spike recorder instance for each thread to allow non-blocking recording and each of these instances opens a separate file. Thus, the I presume that with SPEC CPU you want to test CPU performance rather than I/O system performance. If this is indeed the case, I would suggest to drop the spike time recording to file. This would solve the problem with limited numbers of files. If SPEC rules should allow you to do one run with some form of recording of spikes and another run without it, one could use the following approach: Perform a run with spike recorders in which the spikes are recorded to memory. After then simulation time is up, extract spike data from the spike recorder in the SLI script and write it to file. This only requires one file per process instead of one per thread. Also, read out the spike counter in NEST. Then turn off spike recording completely and re-run the simulation. Read out the spike counter (it is always active) and you should get exactly the same number of spikes as when recording. I can send you code to do this. When simulating the same network with different numbers of threads, you will obtain results that differ in detail, since the random number sequences in NEST are thread-specific. We have developed several measures to verify that simulations do produce statistically consistent results, see the paper by Gutzen et al below. If you positively need identical results independent of the number of threads used, we can develop a work-around for that. For strong-scaling experiments, I would encourage you to use the microcircuit model, as it is currently the most widely used network model used for benchmarking, see the paper by Kurth et al below. That paper also gives a rather recent description of the state of the art, although NEST 3.6 and later now outperform NEST 2.14 that was used in that paper. Concerning a single-thread benchmark, what are the constraints on running a single benchmark? On my MacBook with Core i5, hpc_benchmark.sli takes in total about two minutes for a single run. If that is acceptable, I would suggest using hpc_benchmark. |
And here the promised references:
|
You are correct we want to be CPU bound and not IO bound. But even with the current code, outputting a file per thread, we are quite CPU bound. Your spike recorder in memory idea is very good. If you can coalesce that using sli directives at the end of the simulation, that would solve our problem. The one limitation we have is that the run should fit within 64 GB of virtual memory (for the parallel runs). I imagine that is big enough. Please do work on that when you get a chance. I recognize that the NEST conference is coming up and you will be busy with that! The thread count limitation was only seen in microcircuit.sli. I'll see if I can dig up why. Most OSs have a default open file size limit of 4096 so I think the issue may be more esoteric. I'll take a look at hpc benchmark for single threaded runs. I wanted to have a wide variety of code coverage and CPU behavior in three or four workloads. The time limit for single threaded is 3-5 minutes on a "modern CPU" and it needs to fit within 1.8 GB of memory. |
Regarding exact answers - we do allow some tolerance based on floating point rounding. If there is a lot of randomness, we can try to reduce that via more deterministic randomness. I've seen that in many benchmark candidates, less (or zero) randomness still provides valid answers and doesn't detract from creating a benchmark representative of the application behavior in the field. With my latest rebase I still get issues with running Potjans_2014/microcircuit.sli with more than 63 threads (in the SPEC CPU harness):
I have enough file descriptors:
I played around with the mainline nest build, and realized the issue is not with threads, but with |
I had another look at Concerning setting the number of threads, I am surprised that
M is defined via mpirun. Inside a NEST script you can either set T through the parameter The set of files in One more question: We are mostly using Python scripts in our benchmarking now. Would that be feasible for your benchmarks as well or does that introduce to many confounding factors so that you would want to avoid Python? |
Oh yes, I did change |
Thanks for the explanations. I will create a version of Two more question concerning the SPEC CPU setup: Can you use thread-aware allocators such as jemalloc and can you pin threads to cores? We have seen significant benefits of both in our benchmarks. In one case, we even found that we needed to change BIOS settings to ensure consistent pinning. |
Thank you! Yes these are the two main issues. Outputting all the data into a fixed number of files will solve the problem. Thread affinity is allowed, but that is done at a higher level in the harness, so the user must choose the same pinning scheme for all benchmarks. Custom memory allocators are also allowed. Some benchmarks show gains when linking with jemalloc or tcmalloc, others don't. |
Hi @heshpdx! I have now created a modified version of It is monolithic, i.e., you need only this one SLI file. Parameters can be passed on the command line like this (all optional): nest --userargs=threads=8:scale=0.4:seed=12345 microcircuit_spec.sli It does not write any spikes to files, but write a short report at the end to
I hope this is suitable for SPEC use. With scale 0.4, it should be reasonable for a single thread, with scale 1.0 for larger numbers of threads. As I mentioned earlier, NEST benefits from jemalloc/tcmalloc and similar, and from thread pinning. |
Thanks! I will try this out. For the verification to be scalable, we will change the output to look like the following (which might require commenting out some of the source):
Basically, cut out any run-specific information such as thread count and time. Then we can diff this output while applying a tolerance to the spike numbers (if we need to). Would it be easy to do this for hpc_benchmark too? I will take a look tomorrow. |
To change the output, you only need to comment out or remove a few lines at the end of the script. The "Population spikes" line is only there as a double-check, "Total spikes" should suffice to check that the simulation is running correctly. "Network size" and "Num connections" depend on scale, but not on the number of threads or on the random seed. For actual benchmark runs, you may want to pass "record=false" as userarg. Then, no spike recorders are created and you will only get the "Total spikes", not the "Population spikes". I just pushed another update to the script so that it, if recording, records always from all neurons, removing a potential source of confusion. It should be rather straightforward to apply the same user argument handling (at the beginning of the script) and reporting (at the end) to hpc_benchmark. Them main difference is that hpc_benchmark only has two neuron populations, E(xcitatory) and I(nhibitory). Let me know if you run into problems! |
Thank you, this looks good. If I run it multiple times with the same cmdline, I get the same results. When I run it with varying number of threads, the Total spikes and Population spikes aren't exact but they are close. What kind of delta would you expect? Here is some data using
It appears that the number of local nodes is growing as I add more threading. My machine maxes out with 160 hardware cores, and that shows the most number of local nodes, although the "network size" is invariant across all these runs at 23163. Is this ok? Are any of the output values "out of bounds" in your opinion? It looks like we need a tolerance of about 3% for the spike counts and sum. Is there a way to output just the "Simulation Report" to a separate file instead of stdout? That way I can also avoid lines like Also, I diffed this script against the microcircuit.sli in your mainline. There are a lot of changes! I'm wondering, could you apply the same kind of changes to hpc_benchmark? Humbly speaking, it would be a lot faster for you to do it correctly than for me to learn it. Finally, I have been using BrodyHopfield.sli and brunel_ps.sli as the "test" and "train" workloads which are meant to be very small and kind of small, respectively. Can you help print the same kind of simulation report at the end of those tests to facilitate verification? I tried copying the stanzas from microcircuit.sli, but I am not sure which variables these simulations know about. (alternatively I can scale down microcircuit.sli to also be the test workload, but I would like to have some more code coverage, and not have the "training" workload be exactly the same as the reference benchmark). Thank you! |
|
|
@heshpdx I have just pushed a new version of |
I pulled your code and started playing around with the scripts. I also see hpc_benchmark and cuba_stdp have spec versions, thank you. I had to make this small change to hpc to avoid a syntax error.
I'll try out some values for scale so that the simulation runtime fits within our time and memory budget! |
I played around with microcircuit, cuba_stdp, and hpc_benchmark. The first two are resilient to various system changes and can verify within 3% delta. hpc_benchmark gives a drastically different answer between GCC and LLVM. I ran the models with 80 threads and a Scale which requires about 15 Gb of memory. About 3 minutes of runtime gives:
Do you have insight into why LLVM/clang is so different? We have gone through the source and removed all unstable algorithms and other sources of differences we know of. For example, we replaced std::sort with std::stable_sort, replaced the uniform_distribution/poisson_distribution calls with the LLVM versions as compiled code, and replaced all calls to std::rand with our own "specrand" generator. Usually this does the trick, but there is something else afoot. I would love to get hpc_benchmark working since you mentioned above it is the best one. But I am glad that we have cuba_stdp and microcircuit working at least. |
Some more datapoints, first one from
|
Due to the plasticity in the network, simulations of hpc_benchmark can take very different trajectories, especially if simulated for more than 300 ms. This can lead to wildly differing firing rates, which generally are far higher than biologically plausible rates (around 10 sp/s). To avoid this problem, I have just pushed a version of hpc_benchmark in which the learning rate is set to zero by default, so that synaptic weights remain constant. The simulation still goes through the mechanics of the spike-time dependent plasticity mechanism, but the dynamics will now stay much more stable. Simulation times should also come down noticeably, since firing rates remain reasonable. Furthermore, variations for different seeds/thread numbers/compilers should now be only a few percent. The variation in rates which you observed between GCC and Clang and between different thread numbers is most likely due to different random number sequences generated (poisson random variates). To test whether differences you observe between different thread numbers/compilers are reasonable, you should run simulations with five different random seeds and note the variation in spike numbers you get. Changing the number of threads or the compiler should lead to variations in spike numbers comparable to what you see for different seeds. |
Great, thank you. I had increased the simulation time because it seemed that the majority of the runtime was in the setup phase when the network is being constructed (especially for larger Scale networks), and I wanted to make sure we are benchmarking the simulation as well. |
The new hpc script works well. I had to increase tolerance from 3% to 4% to empirically allow results from 160-thread runs that looked ok otherwise. I did notice that the rng_seed value printed in the report is always 1. Maybe it is printing a bool instead of the value? |
4% tolerance sounds perfectly fine. I am surprised that you always get 1 for the rng_seed. In my tests, I see the seed value that I pass as |
My apologies, the seed issue is my own fault! I had changed the source last year when I ingested the code into the harness, to peg all runs with the same seed. Let me think if I want to open that up again, or keep it fixed for all time.
|
I see your point of fixing the seed for consistency in benchmarks. But NEST has a fixed default seed already, see
The advantage of a configurable seed is that you can check for the range of variation in simulation results for different random number sequences. |
Yes, I agree. I reverted my change so as to gain the flexibility. Things are running well now. I will let you know how it goes! |
@heshpdx Could you contact me by email ([email protected])? |
Hello friends,
I’m a CPU architect at Ampere Computing where I do performance analysis and workload characterization. I also serve on the SPEC CPU committee, working on benchmarks for the next version of SPEC CPU, CPUv8 . We try to find computationally intensive workloads in diverse fields, to help measure performance across a wide variety of behaviors and application domains. Based on the longevity of nest, its large active community in biology and its use in education, I have proposed the nest neural network model be included in the next set of marquee benchmarks in SPEC CPU.
As part of the effort, we have ported and integrated the nest mainline code into the SPEC CPU harness so that it can be tested on a wide variety of systems in a controlled environment to produce reproducible results. We have even built it on native Windows using MSVC and the Intel compiler for Windows – we are happy to share the changes if someone is interested in testing and integrating it back into the upstream mainline for the benefit of the community.
The piece we need help with is an understanding of the multithreaded workloads. Right now, we have single-threaded nest command lines which run and produce verifiable output across many compilers (llvm, gcc, icc, aocc, nvhpc, cray), ISAs (aarch64, x86, power) and operating systems (linux, windows, android). We verify the run via checking the .dat files which come out of the simulation runs to make sure that there are no differences in the resulting output. A problem arises when we run with multiple threads, since there are a different number of files produced, and I am unfamiliar with how to coalesce them to verify.
First some fundamental questions: Does a nest invocation with 8 threads perform the same amount of work as a run with 16 threads? Or is it that the problem being solved is larger? If it is the same, how can we verify that? Does this answer change based on the .sli script used?
In the example below, if I run
examples/nest/brunel-2000_newconnect_dc.sli
(with a small edit to make it run longer)... I tried with 8 threads and 16. It looks like I am simulating the same number of Neurons and Synapses. The 8-thread version outputs 16 files and the 16-threaded version outputs 24 files. The total lines in the files are close. Do they fundamentally contain the same information, just at different sample points?Overall, the goal is to be able to verify that the same amount of work was completed between these two command lines, and verify that they calculated the same result. This allows a benchmark to run on systems with a varying number of hardware cores, so we can measure CPU performance between them. We are allowed to provide some tolerance, in case there is floating point rounding error.
For the multithreaded benchmark, I am exercising the scripts below. The goal is to showcase scalable threading performance, as well as cover a variety of behaviors in the nest simulator.
If you have feedback on which are more or less useful as multithreaded benchmarks, please share your thoughts!
Thank you!
The text was updated successfully, but these errors were encountered: