Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Makefile updates #198

Merged
merged 9 commits into from
Apr 7, 2020
Merged

Makefile updates #198

merged 9 commits into from
Apr 7, 2020

Conversation

manodeep
Copy link
Contributor

@manodeep manodeep commented Apr 6, 2020

Thanks for releasing the code-base. I took a quick look at the Makefile and fixed a couple of issues.

  • Optimisation flags should be on the compile-time and not link-time. I moved the -O3 flag to the compile stage. Plus, the -march=k8 was only targeting the AMD K8 architecture that was released in ~2003-2004 (SSE2 is the "fastest" instructions used on K8, see details below). Modern CPUs have more advanced instructions, and the generic -march=native will optimise for the CPU that COMPAS is being compiled on. (If you are compiling on OzSTAR, then I would recommend using -march=skylake-avx512 instead of -march=native). You guys should re-run any benchmarks and check if there are any speedups.

  • Duplicate CFLAGS on two different lines containing -g and the include options. Replaced with the more appropriate CXXFLAGS for C++ code

  • Replaced all hard-coded COMPAS with a single-sourced variable ($EXE)

  • Embedded the path to the BOOST library into the executable by using the -Xlinker -rpath -Xlinker </path/to/boost/lib>. This will allow the executable to always find the BOOST library used at compile-time (regardless of whether that library is loaded at run-time) -- I suspect this will remove the need for the static linking

  • Numerous other small-updates to the Makefile structure, including removing of include flags at link-time

  • I have also attempted to line-up trailing backslash on the SOURCES lines.

Comparison of instruction-sets enabled by compile-time options.

Here are the instruction sets enabled by -march=k8 and -march=skylake-avx512, checked with g++-7.5.0 on my OSX laptop:

[~/temp/COMPAS/src @Manodeeps-MacBook-Pro] g++ -v
Using built-in specs.
COLLECT_GCC=g++
COLLECT_LTO_WRAPPER=/opt/local/libexec/gcc/x86_64-apple-darwin18/7.5.0/lto-wrapper
Target: x86_64-apple-darwin18
Configured with:
/opt/local/var/macports/build/_opt_bblocal_var_buildworker_ports_build_ports_lang_gcc7/gcc7/work/gcc-7.5.0/configure
--prefix=/opt/local --build=x86_64-apple-darwin18
--enable-languages=c,c++,objc,obj-c++,lto,fortran --libdir=/opt/local/lib/gcc7
--includedir=/opt/local/include/gcc7 --infodir=/opt/local/share/info
--mandir=/opt/local/share/man --datarootdir=/opt/local/share/gcc-7
--with-local-prefix=/opt/local --with-system-zlib --disable-nls
--program-suffix=-mp-7 --with-gxx-include-dir=/opt/local/include/gcc7/c++/
--with-gmp=/opt/local --with-mpfr=/opt/local --with-mpc=/opt/local
--with-isl=/opt/local --enable-stage1-checking --disable-multilib --enable-lto
--enable-libstdcxx-time --with-build-config=bootstrap-debug
--with-as=/opt/local/bin/as --with-ld=/opt/local/bin/ld
--with-ar=/opt/local/bin/ar --with-bugurl=https://trac.macports.org/newticket
--disable-tls --with-pkgversion='MacPorts gcc7 7.5.0_0'
--with-sysroot=/Library/Developer/CommandLineTools/SDKs/MacOSX10.14.sdk
Thread model: posix
gcc version 7.5.0 (MacPorts gcc7 7.5.0_0)

[~/temp/COMPAS/src @Manodeeps-MacBook-Pro] g++ -dM -E -x c++ -std=c++11 -O3
-march=k8  - < /dev/null | grep -i -e sse -e avx
#define __cpp_static_assert 200410
#define __SSE2_MATH__ 1
#define __SSE2__ 1
#define __SSE__ 1
#define __SSE_MATH__ 1

[~/temp/COMPAS/src @Manodeeps-MacBook-Pro] g++ -dM -E -x c++ -std=c++11 -O3
-march=skylake-avx512  - < /dev/null | grep -i -e sse -e avx
#define __AVX512F__ 1
#define __cpp_static_assert 200410
#define __SSE4_1__ 1
#define __core_avx2__ 1
#define __SSE4_2__ 1
#define __AVX512BW__ 1
#define __SSE2_MATH__ 1
#define __AVX__ 1
#define __AVX512DQ__ 1
#define __AVX512VL__ 1
#define __AVX512CD__ 1
#define __AVX2__ 1
#define __tune_core_avx2__ 1
#define __SSE2__ 1
#define __SSSE3__ 1
#define __core_avx2 1
#define __SSE__ 1

@SimonStevenson
Copy link
Collaborator

This should be a pull request into dev not master

@manodeep manodeep changed the base branch from master to dev April 6, 2020 02:29
@manodeep
Copy link
Contributor Author

manodeep commented Apr 6, 2020

@SimonStevenson Oops - fixed now

@SimonStevenson SimonStevenson self-requested a review April 6, 2020 02:35
@SimonStevenson
Copy link
Collaborator

This compiled fine for me, would like to get @jeffriley 's eyes on it as well if possible

@SimonStevenson
Copy link
Collaborator

time make -j $(nproc)
real 7m39.157s
user 24m9.879s
sys 0m37.284s

@jeffriley
Copy link
Collaborator

The changes look fine. Happy somebody is looking at the makefile - I've basically been ignoring it. I'll officially review a bit later.

@SimonStevenson
Copy link
Collaborator

running ./COMPAS -n 1000 on master and this branch
master:
Clock time = 160.132 CPU seconds
Wall time = 0:2:45 (hh:mm:ss)

makefile branch
Clock time = 58.9174 CPU seconds
Wall time = 0:1:3 (hh:mm:ss)

So this leads to a factor of >2 speed up.

The outputs are however not identical, I need to check that this is expected, and that they produce identical outputs for identical inputs.

@SimonStevenson
Copy link
Collaborator

SimonStevenson commented Apr 6, 2020

OK spotted a possible bug in default random seed (issue #199 ). Fixing the seed to be 42:
./COMPAS -n 1000 --random-seed 42

I now get identical outputs, and these are the run times:

master
Clock time = 166.763 CPU seconds
Wall time = 0:2:51 (hh:mm:ss)

makefile
Clock time = 59.2974 CPU seconds
Wall time = 0:1:1 (hh:mm:ss)

@manodeep
Copy link
Contributor Author

manodeep commented Apr 6, 2020

@SimonStevenson Given that -Ofast is not enabled, the likely culprit is the fused-multiply-add operations (e.g., a*y + b becomes one operation rather than 2). If you recompile with -mno-fma and re-run, then the differences might disappear.

Never-mind - fixed :)

@SimonStevenson
Copy link
Collaborator

This is awesome @manodeep !

@manodeep
Copy link
Contributor Author

manodeep commented Apr 6, 2020

As a point of reference, SSE2 operates on 128 bit vector registers (4x floats, 2x doubles) and AVX-512 operates on 512-bit vector registers (16x floats, 8x doubles) - so the best case speedup would be 4x.

SimonStevenson
SimonStevenson previously approved these changes Apr 6, 2020
Copy link
Collaborator

@SimonStevenson SimonStevenson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have checked that this compiles and runs and produces the same binaries as master. I found that for a small test run this leads to a speed up of >2! I am happy signing off on this.

@SimonStevenson
Copy link
Collaborator

I just approved this, but @jeffriley pointed out that we should document Manodeep's suggestion
If you are compiling on OzSTAR, then I would recommend using -march=skylake-avx512 instead of -march=native somewhere. Is it possible to add a flag to the makefile to specify this?

@ilyamandel
Copy link
Collaborator

"time make -j $(nproc)
real 7m39.157s
user 24m9.879s
sys 0m37.284s"

Query: does it make sense to have two compilation options, one that can compile quickly (for quick tests) at the cost of less efficient running, and another that optimises runtime for longer runs at the cost of long compilation times?

@ilyamandel
Copy link
Collaborator

Thank you very much for your advice and help, @manodeep -- we really appreciate it!

@manodeep
Copy link
Contributor Author

manodeep commented Apr 6, 2020

I just approved this, but @jeffriley pointed out that we should document Manodeep's suggestion
If you are compiling on OzSTAR, then I would recommend using -march=skylake-avx512 instead of -march=native somewhere. Is it possible to add a flag to the makefile to specify this?

Btw, the reason why I recommended -march=skylake-avx512 is because gcc < 7 seems not fully enable the available instructions with -march=native (see here). Given that you only see a 2x speedup, just leaving -march=native on by default everywhere is probably fine.

If you guys haven't done so already, I would recommend using Intel Vtune to visually profile the code and see if there are any easy gains. Intel Vtune is available as a module on OzSTAR, just make sure to compile with icpc and with -xHost. Instructions on how to run Vtune is on the Cookies-n-Code page.

@SimonStevenson Happy to chat via slack if that would be easier.

@manodeep
Copy link
Contributor Author

manodeep commented Apr 6, 2020

@ilyamandel Not a problem - glad I could help!

@jeffriley
Copy link
Collaborator

When I compile with this makefile, using gcc version

gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0

I get the following warning for every source file:

utils.cpp: In function ‘void __static_initialization_and_destruction_0(int, int)’:
utils.cpp:501:1: note: variable tracking size limit exceeded with -fvar-tracking-assignments, retrying without
 }

If I add

--param=max-vartrack-size=0

to CXXFLAGS the warning goes away

@manodeep
Copy link
Contributor Author

manodeep commented Apr 6, 2020

@jeffriley Good solve - that error message was annoying. I wonder if your solution is identical to -fno-var-tracking-assignments

@jeffriley
Copy link
Collaborator

@manodeep I think -fno-var-tracking-assignments turns assignment tracking off, whereas --param=max-vartrack-size=0 changes the default limit from 10000 (I think) to unlimited. I'm not sure we need it, so we could just use -fno-var-tracking-assignments

@manodeep
Copy link
Contributor Author

manodeep commented Apr 6, 2020

From a quick googling (earlier), that option seems to facilitate debugging. May be worth removing from production runs, particularly so if the compile times change depending on that parameter settiing.

@manodeep
Copy link
Contributor Author

manodeep commented Apr 7, 2020

@SimonStevenson @jeffriley I made a few more changes that incorporate good Makefile practices.

Should not affect the build, but now has better dependency checker. For instance, if the header files are altered, then the associated object file will be rebuilt. Same for all object files, if the Makefile is updated

I am done with the edits. Any other edits will be in response to code-review from you guys.

@SimonStevenson
Copy link
Collaborator

SimonStevenson commented Apr 7, 2020

Thanks @manodeep , I'll rerun my tests. Can you respond to @ilyamandel 's query above regarding optimising for compilation time vs optimising for runtime?

Query: does it make sense to have two compilation options, one that can compile quickly (for quick tests) at the cost of less efficient running, and another that optimises runtime for longer runs at the cost of long compilation times?

@SimonStevenson
Copy link
Collaborator

Compiles and runs fine for me!

manodeep/makefile
./COMPAS -n 1000 --random-seed 42
Clock time = 55.4875 CPU seconds
Wall time = 0:0:56 (hh:mm:ss)

@manodeep
Copy link
Contributor Author

manodeep commented Apr 7, 2020

@SimonStevenson Ohh I didn't realise that query was directed at me! Yup - totally possible to create two compilation pathways, but that requires a choice about usability. How do you guys envision running these two different compilation tracks? I have implemented one solution, where the non-optimised build is the default (i.e., mirrors the setup prior to this PR). Running make fast or make staticfast creates the optimised builds (faster runtime, slower compile time), whereas simply typing make or make static creates the non-optimised (slower runtime, faster compile time).

@SimonStevenson When you re-ran the code just now - was the output still identical to before?

@SimonStevenson SimonStevenson dismissed their stale review April 7, 2020 01:20

Substantial changes since approved, should be reapproved

@ilyamandel
Copy link
Collaborator

@manodeep -- thank you, exactly what I had in mind!

@SimonStevenson
Copy link
Collaborator

Hi @manodeep , thanks for this!

Compiling with
time make -j $(nproc)

real 3m6.504s
user 8m9.177s
sys 0m28.017s

Compiling with
time make fast -j $(nproc)

real 7m10.527s
user 23m42.493s
sys 0m34.970s

Checked that it runs and produces the same output in each case using
./COMPAS -n 1000 --random-seed 42

When you re-ran the code just now - was the output still identical to before?
Yes

Copy link
Collaborator

@SimonStevenson SimonStevenson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have checked that the code compiles using both
make
which compiles faster and runs slower
and
make fast
which compiles slower and runs faster
and produces the same output in each case.

@SimonStevenson SimonStevenson merged commit 8a298af into TeamCOMPAS:dev Apr 7, 2020
@manodeep manodeep deleted the makefile branch April 7, 2020 02:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants