-
Notifications
You must be signed in to change notification settings - Fork 145
3rd EasyBuild hackathon meeting minutes day 1
boegel edited this page Mar 21, 2013
·
11 revisions
(Monday Mar. 11th 2013, 10am-6pm)
The first day of the 3rd EasyBuild hackathon consisted of presentations, discussions and initial hands-on experience with EasyBuild for attendees new to the tool. These notes were mainly taken by Kenneth and Jens, with contributions by Fotis.
- Kenneth Hoste (HPC-UGent, EasyBuild developer and release manager)
- Jens Timmerman (HPC-UGent, EasyBuild developer)
- Fotis Georgatos (University of Luxembourg, HPC sysadmin and active contributor)
- Jens Wiegand (The Cyprus Institute, LinkSCEEM project manager)
- Thekla Loizou (The Cyprus Institute, HPC user support)
- George Tsouloupas (The Cyprus Institute, HPC sysadmin and user support)
- Stelios Erotokritou (The Cyprus Institute, HPC user support/PRACE)
- Mohamed Gafaar (Bibliotheca Alexandrina, HPC sysadmin/user support)
- Dina Mahmoud Ibrahim (Cairo University, HPC sysadmin/user support)
- Alan O'Cais (Jülich Supercomputing Centre, HPC user support & LinkSCEEM)
- Alexander Schnurpfeil (Jülich Supercomputing Center, HPC user suppot)
- Nicolas Kanaris (The Cyprus Institute, HPC user (OpenFOAM))
- George Fanourgakis (The Cyprus Institute, HPC user, molecular dynamics)
- Demetris Charalambous (Cyprus Meteorological Servicei, HPC user support?, weather forcecasting (WRF, ...))
- Ioanna Kalvari (University of Cyprus (bioinformatics), HPC user)
- Ioannis Kirmitzoglou (University of Cyprus (bioinformatics), HPC user)
- Adam DeConinck (NVIDIA Corporation, HPC sysadmin) [remote via Skype]
- [10am-10.10am] presentation on LinkSCEEM project
- [10.10am-10.15am] introduction round: who's who?
- [10.15am-12am] presentation on EasyBuild: Building Software With Ease
- [1.30pm - 1.45pm] presentation by Jülich Supercomputing Cente (JSC) on current activities and plans with EasyBuild
- [1.45pm - 2.30pm] presentation by Cyprus Institute on current activities and plans with EasyBuild
- [2.30pm - 5pm] discussions + initial hands-on experience with EasyBuild
- [5pm - 6pm] presentation by NVIDIA: Introduction to the CUDA Toolkit for Building Applications
- [6pm - 8pm] aftermath: discussions w.r.t. CUDA support in EasyBuild
- goal of LinkSCEEM project: establish HPC ecosystem in Eastern Middle-Terranean
- resources, training, expertise, connectivity, ...
- online training content is very important!
- with support from NCSA, Jülich, ..
- www.isgtw.org
- CSC2013 PRACE conference in Cyprus
- Alan's subject: performance analysis and optimisation of community codes
-
--download-only
command line option is missing (see framework#538) - can be done indirectly now:
- using
--stop fetch
, but will fail after first download failed - using
--regtest
, and 'breaking'--job
so no jobs are submitted for the builds - EasyBuild bootstrap script [question]
- why not include fixed Python 2.7 version in bootstrap procedure, so we have full control of Python version being used
- add Python module as a dependency for EasyBuild
- why does something have to be part of a toolchain, and not just a dependency? [question]
- can toolchain be extended dynamically?
- e.g. include dependencies in toolchain as well?
-
ictce
=>ictcee
(e
for extended)? - creating yet another toolchain means that whole stack of dependencies needs to be rebuilt, which is a pain [question]
- and results in further explosion of the set of available modules
- create big fat toolchains and filter stuff out in
toolchainopts
withfilter
option? - are existing modules reused if they're needed? [answer: yes]
- supporting alternative module naming schemes (see framework#173)
- basically just provide multiple alternative views on the existing modules - let's escape from the "one size show" concept
- flat (cray optimal) or hierarchical (lmod optimal), all can be valid
- hierarchical can be top-down (compiler->libraries->apps) or vice versa (software on top, that is what the users care about)
- setting up mirrors for sources
- initial mirror prototype already in development/use at Uni.Lu; ~36GBs of software (open source & closed)
- The split between redistributable and non-, is unavoidable for a public service; zsync could help with either
-
--try-amend
source URL should be supported via EasyBuild configuration file (see framework#462) - currently already supported via
EASYBUILD_TRY_AMEND
env var - Trilinos and other such multi-dep libs, should be in bold on slide with supported software
- can
build_in_install_dir
be specified in easyconfig file? It might reduce needs for easyblocks in bioinfo packages (see framework#539) -
ictce/3.2.2.u3
toolchain sources are no longer available, so use other toolchain in examples (WRF) - also there are bugs in icc/11.1.07* & flexlm, which are time-consuming to debug (EB processes are affected)
- chroot/jail into installation prefix, as a proper containment solution when building software: (see framework#540)
- it makes it much safer to build 1000s of packages coming from pkgsrc (at least, at build time!)
- it will allow to catch osdependencies that now escape unnoticed
- it seems to be the correct thing to do also in relation to hashdist
- Fotis: no need to build safety features inside easyconfigs, IMHO, that has no clear benefit
- possible directions on how to modify the current namespace (eg. think of lower-case modules):
- customize modulefiles with sed and change what is needed (can be tricky or risky to be correct)
- manually cultivate a symlink farm (only good for the first load, deps will be like before)
- can EasyBuild cope with existing modules (e.g., OpenMPI), or do they need to be rebuilt?
- will not work out of the box, because those modules will be missing things like EBROOT, EBVERSION
- rebuilding them is the best idea, and will enable you to roll out software again after reinstall of system with different OS
- adding support for BlueGene Q system
- perfect timeframe since it's quite new
- specific characteristic: crosscompilation, IBM XLC compiler, ...
- on BlueGene systems, running of tests will need to be skipped or done differently (remotely)
- skipping can be done with e.g.
--try-amend=skipsteps=test,test_cases
-
OpenFOAM
is a pain because of large difference in system characteristics - error/warning log parser now spits out lots of false positives (see framework#541)
- regular expression used needs to be documented well
- need to enhance regex to reduce amount of false positives
-
bbcp
can never work if the required ports are not open - add a test case for this?
- support a way of spitting out a warning about this at the end of the installation (see framework#542)
- configuration file (and .eb files) will not be executed anymore => needs to be followed up (w/ Stelios)
- more stuff needs to be shoved into toolchains: Python (George T.), zlib, ...
- document where to put source files (see wiki:Configuration)
- supercomputing training portal: http://linksceem.eu/ls2/component/content/article/198
- big fat training cluster via VMs w/ terminal emulation in web browser
- rely on EasyBuild to get software installed on this
- should function well in a low-bandwidth environment (important for LinkSCEEM)
- documentation portal w.r.t. supercomputers, just collect links to useful documentation around the web
- try and get a toolchain working for BlueGene Q systems
- categories of software used in PRACE
N.B. This presentation incorporates also the 3rd-party needs of Uni.Lu
- strong commitment from Cyprus Institute to EasyBuild
- was really useful to quickly set up a software stack
- GPU software, even though not all apps were there
- useful for conformity across LinkSCEEM institutions - very important
- also for setting up post-processing nodes (w/ different OS)
- robot should not depend on filename, but on contents of easyconfig
- need support for multiple source paths (and dependencies? => Jens T.)
-
--download-only
(from mirrors), needed in relation to jumpstart procedures - document jail tool (and add it to bootstrap)
- goolf: OpenMPI >=1.5, OpenBLAS, FFTW w/
--enable-avx
- custom variables in module files (see
modextravars
) - CUDA support: in toolchain or not?
- different CUDA versions, dependency chain, ...
- PGI toolchain is also important for CyI
- OFED vs no-OFED: easy way to get rid of
-no-OFED
version suffix? via EB config file? - user environment: multiple source paths, custom version suffix for 'tagging' your own builds
- FFTW single/double precision
- separate module (and thus separate toolchains) vs 'fat' build
- support both single/double in a single FFTW module requires running configure/make/make install twice; doable? Yes.
- local climate group requirements: Ferret, ... (see George F.)
- Python as a part of the toolchain?
- managing multiple EB versions
- need a guidelines and best practices for EasyBuild
- what should sites customize? what should be left untouched (e.g. becaue it'll break in the future with new EB versions)
- how to override default use of
/tmp
(set$TMPDIR
, which will be picked up by Python) - explosion of available modules
- we need a good way to handle that
- in combination with flexible module naming scheme?
- figure out exact build options that were used
- document querying log for e.g. configure options
- or provide tools for it (via
eb
) - will be useful in the future as part of any kind of
EasyDoc
activity, to post information on a website - issue of OS dependencies: portable way of specifying them
- making sure we catch all dependencies (jail tool provided by HashDist)
-
goalf
/ictce
versioning schemes need to be documented - e.g. add
--enable-avx
to FFTW but keep toolchain version the same? - sidenote:
--enable-avx
with FFTW apparently is suboptimal for GROMACS, up to 20% perf loss (!) - see http://www.gromacs.org/Documentation/Installation_Instructions#3.2.1._Running_in_parallel
- will need to bump ATLAS anyway (because of Sandy Bridge support), hence
goalf
as well - keep versions fixed but tweaking builds is not a good idea
- OpenMPI version: let's bump to v1.5, or later
- ABI is guaranteed to be compatible at version onwards (ie. change on the fly the OpenMPI module version, without need to rebuild code)
- part of new goalf (v2.x?)
- GROMACS on top of new
goalf
(goolf
) - (Fotis took over for the remainder)
- PRACE production environment
- largely done; mainly a set of environment variables is missing (JT: could use
modextravars
for that) - EasyBuild enables you to set up a PRACE environment in user space (huge selling point: you only need GCC/EnvMod/Py)
- support for custom site-specific environment variables
- fi. *_LICENSE_FILE, DEBUGGERS environment and so on;
- see
modextravars
stated above - HPCBIOS pitch: ie. a standardization effort, collection of policies, see http://hpcbios.readthedocs.org/en/latest/
- re-usable high-level documentation (intention is to attach it as .pdf to Users' Welcome Letters)
- provide "standard" working environment for e.g. climate science
- [Alan] documentation for building CUDA applications provided by NVIDIA is very useful and hard to come by!
- NVIDIA CUDA with OpenMPI: K20 + Mellanox IB
- 'drop-in' libraries: cuBLAS
- actually a misnomer, since it provides different function names
- CUDA (C)
- compilers + tools
-
nvcc
should always be used, unlikempicc
which is optional (can handle linking with MPI libs yourself) - runtime API (
libcudart.so
) - IDE, visual profiler
- collection of libraries (CUBLAS, CUFFT, Thrust, ...)
- quite similar to e.g. MPI (?)
- usually there's a configure option like
--enable-cuda
, but no standard - similar for installation prefix for CUDA, e.g.
CUDA_HOME
(quite similar across apps) - some apps ship their own runtime
- be careful with setting
LD_LIBRARY_PATH
in a CUDA context -
nvcc
treats C code like C++ (!) -
-use-fast-math
mostly targets single-precision stuff - can be broken up into parts
-
-Xptxas=iv
can always be set for having debug info - most MPI implementations support CUDA (except for Intel MPI)
- things may break for older versions
- significant performance gain if communication layer supports DMA to GPU memory (GPU Direct, requires IB QDR/DFR and Mellanox)
- FG sidenote: ie. MVAPICH + CUDA is the likely the preferred testing ground
- test examples:
matrixMul
,simpleMPI
- CMake 2.8.7+ for good CUDA support
- object code for correct device architecture should be used for best performance
- make sure PTX is not being used, which is JIT-compiled and thus leads to startup performance loss
- default is to target oldest PTX/object code:
--gencode
default tocompute_10,code=sm_10
- usually something is set in the application Makefile
OPENCCFLAGS=gencode...
PTXARGS=--arch,...
- can CUDA toolkit be queried for what options are optimal for current device architecture?
- possible options for
gencode
depend on CUDA compiler version - build for "everything under the sun"
- will fail on older
nvcc
systems - larger binary
- OpenACC: very similar to OpenMP (only PGI, Cray, CAPS compilers for now)
- for PGI:
-
-acc
-> use OpenACC -
-Minfo=accel
-> compile for accelerator (not CPU, which is also possible) -
-ta=nvidia
(vs AMD or Intel accelerators) - default architectures targetted are current GPU + major versions (1.0, 2.0, 3.0), but can be tuned via command line options
- CUDA compiler commands depend on compiler being used, e.g.
pgfortran
(PGI) for CUDA Fortran -
nvcc
is actually an LLVM frontend - developer.nvidia.com/llvm
- http://docs.nvidia.com
- http://developer.nvidia.com/nvidia-registered-developer-program to get GPUs
- adeconinck [at] nvidia.com for questions
- follow-up conf. call
- packaging of CUDA toolkit (
redhat
,fedora
,ubuntu
, ...) - reason to not have a monolitic install is OS-specific stuff like paths for files, driver, etc.
- "let user provide CUDA toolkit installation instead of through EasyBuild" (not a good idea)
* FG sidenote: optimal practice @LU & @CY converges in the direction that:
- CUDA toolkit & software installation is user space => this has to be done with easybuild, multiple versions OK.
- driver installation must be done in root space => non-easybuild business, configuration management instead
- use
-silent
for only toolkit (not driver) - OS-independent install package is being looked into
- can notes for NAMD be provided as well?
- CUDA vs Xeon Phi build process
- Xeon Phi is via magic options in Intel compilers (cfr. PGI)
- different modes: native mode (all on Xeon Phi), offload mode (host + Phi as accelerator), OpenACC (via pragmas in code), x86-only (run x86 binary on Phi)
- open questions
- standard variables for CUDA compilers (e.g.
CUDA_CC
) - Intel compilers + CUDA?
- which
gencode
options should be set for CUDA-enabled toolchain? - performance issues with 'fat' binaries (multiple device architectures) due to instruction cache bottleneck?
- can compiled binary be queried for which options were used? (George T.)
- environment modules may be Tcl-only version, which only provides
modulecmd.tcl
- EasyBuild needs to be able to handle that
-
modulecmd
may not be in path, but hardcoded inmodule
- make bootstrap script work offline too, i.e. add option to supply it the required source tarballs
- goolf v1.5.10
- GCC 4.7.2
- OpenMPI >=1.6.3 (1.7rc8 not production ready + requires GCC > v4.8!)
- OpenBLAS 0.2.6
- LAPACK 3.4.2
- FFTW 3.3.3 (single/double)
- ScaLAPACK 2.0.2
- problems with EasyBuild bootstrap script during training exercises
- use
modulecmd help
instead of-H
(latter doesn't work with Tcl environment modules?) - warn about installing with root
-
lib
vslib64
- offline mode for bootstrap script required, e.g. login nodes can not go online (Mohamed)
- but workernodes can :)