Releases: ICLDisco/parsec
parsec-4.0.2411
Curated Change log
Added
- PaRSEC API 4.0.
- Add DTD CUDA support including NEW tiles in DTD.
- Add RoCM/HIP device support.
- Add IrisXE/Level0 device support (experimental).
- Enable users to manage their own data copies without PaRSEC interfering. Data copies are marked as being owned by PaRSEC or
not and managed by PaRSEC or not. A data copy owned by PaRSEC can be reclaimed by PaRSEC when its reference count reaches 0, a data copy managed by PaRSEC can be copied / moved onto a different device, while a data copy not managed by PaRSEC will never be
moved by the runtime. - Add an info system, and introduce two info hooks. See
parsec/class/info.h
for details. The info system allows the user to register info objects with different levels of structures and dynamic objects in the PaRSEC runtime. - PTG supports user-defined routines to move data between GPU and CPU, and user-defined sizes for buffers allocated on the GPU.
- PTG supports reshaping data propagated between local tasks and the speficiation of two types on acccesses to data colletions.
- PINS log
SCHEDULE_BEGIN
andSCHEDULE_END
events to better track tasks lifecycle. - Detect and report oversubscribed binding of core resources.
- PaRSEC Thread binding can be disabled (
bind_threads 0
MCA parameter). - Load balancing between GPUs can be tuned (
device_load_balance_skew
MCA parameter). - Load balancing exclusivity between CPU/GPUs can be disabled (
device_load_balance_allow_cpu
MCA parameter). - Data sent in messages can be of variable size.
- New API
parsec_context_query
can be used to obtain information on the system, like the number of devices, ranks, etc. - New active-message communication API gives low-level access to the PaRSEC communication system to DSLs.
Changed
- Single letter command line options have been replaced with
--mca
parameters.--help
is now--parsec-help
. - Renamed symbols related to data distribution to properly prefix them with the
parsec_
prefix. The old symbols have been deprecated. - DTD interface change: the global array parsec_dtd_arena_datatypes is replaced with functions to create, destroy, and get arena
datatypes for DTD, and these objects now live inside the parsec context. PARSEC_SUCCESS
changed to0
(from-1
), all values forPARSEC_ERR_XYZ
changed.- PaRSEC now requires CMake 3.21.
- PaRSEC profiling tools now require Python 3.x
- PaRSEC profiling system does not require for local dictionaries to be identical between ranks anymore.
time_estimate
functions can be used to control task load balancing (replacesweight
PTG property).
Deprecated
- data distribution w/o the
parsec_
prefix. Further documentation (including a
sed script) can be found incontrib/renaming
.
Removed
- PaRSEC API 3.0
- RECURSIVE Device support (this is temporary and will be restored in a future version).
- Removed obsolete
dbp2paje
tool;h5totrace
is the replacement tool to use. This removes the optional dependency on GTG. - Removed all command line options not prefixed by
--mca
, except for--parsec-help
and--parsec-version
. - Using more than
PARSEC_GPU_MAX_WORKSPACE
workspaces per device will now cause an error (instead of computing incorrect values). - PTG property
weight
(replaced bytime_estimate
).
Fixed
- DTD Termination detection would occasionally assert.
- Multiple bugs with GPU data ownership causing crashes and incorrect results when executing with more than 1 GPU.
- Device-to-device memory copies would not work in some scenarios.
- Suboptimal ordering of members in broadcast tree could cause performance reduction.
- Cray MPI and MPICH would crash in
MPI_Cancel
and when usingNULL
datatypes. - Do not report incorrect flops/s capabilities (
device_show_capabilities
MCA parameter). - On some systems PaRSEC would allocate more GPU memory than is available on the device.
- Performance with large number of GPU tasks with the same priority would be poor due to overhead of sorting by priority.
Known Bugs
- PaRSEC Thread binding ignores externally provided binding (e.g., a cpuset enforced by
srun
); see issue ICLDisco/dplasma#9. - Enabling the
RECURSIVE
device will cause crashes (it is disabled by default in this release); see issues #548, #541. - Running out of GPU memory when using the NEW keyword in PTG may cause deadlocks; see issue #527.
Security
Merged Pull Requests
List of merged pull requests
- [BBT#582] bugfix/atomic lifo: The offsetof was incorrect leading to lifo padding being wrong in external lifo by @abouteiller in #316
- First sketch of a github action for building by @bosilca in #309
- Miscellaneous profiling fixes by @omor1 in #320
- Per-language compiler flags by @therault in #326
- [BBT#541] A new way to install the internal headers by @bosilca in #322
- Doc/GitHub by @abouteiller in #330
- Provide a temporary fix for the flag detection. by @bosilca in #336
- We need BISON 3, and try to automatically pick the brew variant on Mac OSX by @abouteiller in #331
- Clean strings usages in CMake. by @bosilca in #340
- Allow the runtime to compile even when PTG support is not possible. by @bosilca in #332
- Work around GCC bug for atomic_thread_fence with memory order acquire by @devreal in #343
- Fix parsec_future: volatile and memory barriers by @devreal in #342
- Reshape test: variable used for polling should be volatile by @devreal in #344
- Dust off the cmake_modules by @abouteiller in #346
- New CMake versions use MPI_ROOT to find MPI by @abouteiller in #345
- Fallback using a compatible HWLOC. by @bosilca in #341
- hotfix: compile failure when Ayudame not found by @abouteiller in #348
- Fix/quick fixes by @bosilca in #350
- Update issue template to make it easier to read and easier to fill-up by @abouteiller in #349
- Update the installation instructions by @abouteiller in #354
- Cleanup/ptgpp assignments by @abouteiller in #352
- Apply -g3 to DEBUG only, set default config to Release by @abouteiller in #347
- Profiling msync and header commit by @therault in #337
- Removing hard flex/bison dependency: only devs need to run the parser by @abouteiller in #335
- Hicma/recursive by @bosilca in #328
- Fix/deprecated support by @bosilca in #362
- Add the filename to the generated profiling event name. by @bosilca in #359
- Fix atomics on macosX not working properly (missing header) by @abouteiller in #356
- Remove never compiled in '64bit' lifo implementation by @abouteiller in #360
- Fix/many small updates by @bosilca in #363
- Make the ParsecCompilerFlags.cmake self contained by @abouteiller in #364
- Profiling fix: parsec_init(NULL, NULL) by @therault in #339
- GitHub runner with spack by @bosilca in #333
- Update PAPI SDE to fit the current API by @therault in #365
- ucontext is not supported on OSX. by @bosilca in #366
- recursive cb type was not correct by @abouteiller in #368
- Since new policy, setting the non-cache variable creates an empty cache by @abouteiller in #367
- Do now allow spack to be updated automatically. by @bosilca in #375
- flex: on some machines, flex cannot work if parsec/utils is not created by @abouteiller in #374
- Attempt to backport the revamp of the communication engine by @devreal in #380
- Respect DISTDIR is provided. by @bosilca in #383
- [RFC] profiling tools: more efficient cross-stream event matching by @omor1 in #372
- Hash table: count used buckets only when needed by @devreal in #379
- Print the debug rank from device_show_statistics by @abouteiller in #386
- Handle error in CUDA/HIP module init and configurable max_streams by @therault in #351
- Update to a newer spack compiler by @bosilca in #392
- Make the PUSHOUT and other DTD GPU concepts generic by @abouteiller in #387
- Workaround current CUDA/HIP "solution suspicious" bug... by @therault in #381
- dtd_bench_simple_gemm.c relies on non-standard cblas.h file by @therault in https://...
v3.0.2209
PaRSEC 22.09 (September 2022) API 3.0
- Fix PaRSEC not compiling with gcc 10+
v3.0.2012
PaRSEC 20.12 (December 2020)
-
PaRSEC API 3.0
-
PaRSEC now requires CMake 3.16.
-
New configure system to ease the installation of PaRSEC. See
INSTALL for details. This system automates installation on most DOE
leadership systems. -
Split DPLASMA and PaRSEC into separate repositories. PaRSEC moves from
cmake-2.0 to cmake-3.12, using targets. Targets are exported for
third-party integration -
Add visualization tools to extract user-defined properties from the
application (see: PR 229 visualization-tools) -
Automate expression of required data transfers from host-to-device and
device-to-host to satisfy depencencies (and anti-dependencies). PaRSEC tracks
multiple versions of the same data as data copies with a coherency algorithm
that initiates data transfers as needed. The heurisitic for the eviction policy
in out-of-memory event on GPU has been optimized to allow for efficient
operation in larger than GPU memory problems. -
Add support for MPI out-of-order matching capabilities; Added capability
for compute threads to send direct control messages to indicate completion
of tasks to remote nodes (without delegation to the communication thread) -
Remove communication mode EAGER from the runtime. It had a rare
but hard to correct bug that would rarely deadlock, and the performance
benefit was small. -
Add a Map operator on the Block Cyclic matrix data collection that
performs in-place data transformation on the collection with a user provided
operator. -
Add support in the runtime for user-defined properties evaluated at
runtime and easy to export through a shared memory region (see: PR
229 visualization-tools) -
Add a PAPI-SDE interface to the parsec library, to expose internal
counters via the PAPI-Software Defined Events interface. -
Add a backend support for OTF2 in the profiling mechanism. OTF2 is
used automatically if a OTF2 installation is found. -
Add a MCA parameter to control the number of ejected blocks from GPU
memory (device_cuda_max_number_of_ejected_data). Add a MCA parameter
to control wether or not the GPU engine will take some time to sort
the first N tasks of the pending queue (device_cuda_sort_pending_list). -
Reshape the users vision of PaRSEC: they only have to include a single
header (parsec.h) for most usages, and link with a single library
(-lparsec). -
Update the PaRSEC DSL handling of initial tasks. We now rely on 2
pieces of information: the number of DSL tasks, and the number of
tasks imposed by the system (all types of data transfer). -
Add a purely local scheduler (ll), that uses a single LIFO per
thread. Each schedule operation does 1 atomic (push in local queue),
each select operation does up to t atomics (pop in local queue, then
try any other thread's queue until they are all tested empty). -
Add a --ignore-properties=... option to parsec_ptgpp
-
Change API of hash tables: allow keys of arbitrary size. The API
features how to build a key from a task; how to hash a key into
1 <= N <= 64 bits; and how to compare twy keys (plus a printing
function to debug). -
Change behavior of DEBUG_HISTORY: log all information inside
a buffer of fixed size (MCA parameter) per thread, do not allocate
memory during logging, and use timestamp to re-order output
when the user calls dump() -
DTD interface is updated (new flag to send pointer as parameter,
unpacking of paramteres is simpler etc). -
DTD provides mca param (dtd_debug_verbose) to print information
about traversal of DAG in a separate output stream from the default.