Skip to content

Releases: ICLDisco/parsec

parsec-4.0.2411

19 Nov 13:51
cdb2e7f
Compare
Choose a tag to compare

Curated Change log

Added

  • PaRSEC API 4.0.
  • Add DTD CUDA support including NEW tiles in DTD.
  • Add RoCM/HIP device support.
  • Add IrisXE/Level0 device support (experimental).
  • Enable users to manage their own data copies without PaRSEC interfering. Data copies are marked as being owned by PaRSEC or
    not and managed by PaRSEC or not. A data copy owned by PaRSEC can be reclaimed by PaRSEC when its reference count reaches 0, a data copy managed by PaRSEC can be copied / moved onto a different device, while a data copy not managed by PaRSEC will never be
    moved by the runtime.
  • Add an info system, and introduce two info hooks. See parsec/class/info.h for details. The info system allows the user to register info objects with different levels of structures and dynamic objects in the PaRSEC runtime.
  • PTG supports user-defined routines to move data between GPU and CPU, and user-defined sizes for buffers allocated on the GPU.
  • PTG supports reshaping data propagated between local tasks and the speficiation of two types on acccesses to data colletions.
  • PINS log SCHEDULE_BEGIN and SCHEDULE_END events to better track tasks lifecycle.
  • Detect and report oversubscribed binding of core resources.
  • PaRSEC Thread binding can be disabled (bind_threads 0 MCA parameter).
  • Load balancing between GPUs can be tuned (device_load_balance_skew MCA parameter).
  • Load balancing exclusivity between CPU/GPUs can be disabled (device_load_balance_allow_cpu MCA parameter).
  • Data sent in messages can be of variable size.
  • New API parsec_context_query can be used to obtain information on the system, like the number of devices, ranks, etc.
  • New active-message communication API gives low-level access to the PaRSEC communication system to DSLs.

Changed

  • Single letter command line options have been replaced with --mca parameters. --help is now --parsec-help.
  • Renamed symbols related to data distribution to properly prefix them with the parsec_ prefix. The old symbols have been deprecated.
  • DTD interface change: the global array parsec_dtd_arena_datatypes is replaced with functions to create, destroy, and get arena
    datatypes for DTD, and these objects now live inside the parsec context.
  • PARSEC_SUCCESS changed to 0 (from -1), all values for PARSEC_ERR_XYZ changed.
  • PaRSEC now requires CMake 3.21.
  • PaRSEC profiling tools now require Python 3.x
  • PaRSEC profiling system does not require for local dictionaries to be identical between ranks anymore.
  • time_estimate functions can be used to control task load balancing (replaces weight PTG property).

Deprecated

  • data distribution w/o the parsec_ prefix. Further documentation (including a
    sed script) can be found in contrib/renaming.

Removed

  • PaRSEC API 3.0
  • RECURSIVE Device support (this is temporary and will be restored in a future version).
  • Removed obsolete dbp2paje tool; h5totrace is the replacement tool to use. This removes the optional dependency on GTG.
  • Removed all command line options not prefixed by --mca, except for --parsec-help and --parsec-version.
  • Using more than PARSEC_GPU_MAX_WORKSPACE workspaces per device will now cause an error (instead of computing incorrect values).
  • PTG property weight (replaced by time_estimate).

Fixed

  • DTD Termination detection would occasionally assert.
  • Multiple bugs with GPU data ownership causing crashes and incorrect results when executing with more than 1 GPU.
  • Device-to-device memory copies would not work in some scenarios.
  • Suboptimal ordering of members in broadcast tree could cause performance reduction.
  • Cray MPI and MPICH would crash in MPI_Cancel and when using NULL datatypes.
  • Do not report incorrect flops/s capabilities (device_show_capabilities MCA parameter).
  • On some systems PaRSEC would allocate more GPU memory than is available on the device.
  • Performance with large number of GPU tasks with the same priority would be poor due to overhead of sorting by priority.

Known Bugs

  • PaRSEC Thread binding ignores externally provided binding (e.g., a cpuset enforced by srun); see issue ICLDisco/dplasma#9.
  • Enabling the RECURSIVE device will cause crashes (it is disabled by default in this release); see issues #548, #541.
  • Running out of GPU memory when using the NEW keyword in PTG may cause deadlocks; see issue #527.

Security

Merged Pull Requests

List of merged pull requests
  • [BBT#582] bugfix/atomic lifo: The offsetof was incorrect leading to lifo padding being wrong in external lifo by @abouteiller in #316
  • First sketch of a github action for building by @bosilca in #309
  • Miscellaneous profiling fixes by @omor1 in #320
  • Per-language compiler flags by @therault in #326
  • [BBT#541] A new way to install the internal headers by @bosilca in #322
  • Doc/GitHub by @abouteiller in #330
  • Provide a temporary fix for the flag detection. by @bosilca in #336
  • We need BISON 3, and try to automatically pick the brew variant on Mac OSX by @abouteiller in #331
  • Clean strings usages in CMake. by @bosilca in #340
  • Allow the runtime to compile even when PTG support is not possible. by @bosilca in #332
  • Work around GCC bug for atomic_thread_fence with memory order acquire by @devreal in #343
  • Fix parsec_future: volatile and memory barriers by @devreal in #342
  • Reshape test: variable used for polling should be volatile by @devreal in #344
  • Dust off the cmake_modules by @abouteiller in #346
  • New CMake versions use MPI_ROOT to find MPI by @abouteiller in #345
  • Fallback using a compatible HWLOC. by @bosilca in #341
  • hotfix: compile failure when Ayudame not found by @abouteiller in #348
  • Fix/quick fixes by @bosilca in #350
  • Update issue template to make it easier to read and easier to fill-up by @abouteiller in #349
  • Update the installation instructions by @abouteiller in #354
  • Cleanup/ptgpp assignments by @abouteiller in #352
  • Apply -g3 to DEBUG only, set default config to Release by @abouteiller in #347
  • Profiling msync and header commit by @therault in #337
  • Removing hard flex/bison dependency: only devs need to run the parser by @abouteiller in #335
  • Hicma/recursive by @bosilca in #328
  • Fix/deprecated support by @bosilca in #362
  • Add the filename to the generated profiling event name. by @bosilca in #359
  • Fix atomics on macosX not working properly (missing header) by @abouteiller in #356
  • Remove never compiled in '64bit' lifo implementation by @abouteiller in #360
  • Fix/many small updates by @bosilca in #363
  • Make the ParsecCompilerFlags.cmake self contained by @abouteiller in #364
  • Profiling fix: parsec_init(NULL, NULL) by @therault in #339
  • GitHub runner with spack by @bosilca in #333
  • Update PAPI SDE to fit the current API by @therault in #365
  • ucontext is not supported on OSX. by @bosilca in #366
  • recursive cb type was not correct by @abouteiller in #368
  • Since new policy, setting the non-cache variable creates an empty cache by @abouteiller in #367
  • Do now allow spack to be updated automatically. by @bosilca in #375
  • flex: on some machines, flex cannot work if parsec/utils is not created by @abouteiller in #374
  • Attempt to backport the revamp of the communication engine by @devreal in #380
  • Respect DISTDIR is provided. by @bosilca in #383
  • [RFC] profiling tools: more efficient cross-stream event matching by @omor1 in #372
  • Hash table: count used buckets only when needed by @devreal in #379
  • Print the debug rank from device_show_statistics by @abouteiller in #386
  • Handle error in CUDA/HIP module init and configurable max_streams by @therault in #351
  • Update to a newer spack compiler by @bosilca in #392
  • Make the PUSHOUT and other DTD GPU concepts generic by @abouteiller in #387
  • Workaround current CUDA/HIP "solution suspicious" bug... by @therault in #381
  • dtd_bench_simple_gemm.c relies on non-standard cblas.h file by @therault in https://...
Read more

v3.0.2209

11 Sep 22:42
854aade
Compare
Choose a tag to compare

PaRSEC 22.09 (September 2022) API 3.0

  • Fix PaRSEC not compiling with gcc 10+

v3.0.2012

03 Mar 22:47
d2ae417
Compare
Choose a tag to compare

PaRSEC 20.12 (December 2020)

  • PaRSEC API 3.0

  • PaRSEC now requires CMake 3.16.

  • New configure system to ease the installation of PaRSEC. See
    INSTALL for details. This system automates installation on most DOE
    leadership systems.

  • Split DPLASMA and PaRSEC into separate repositories. PaRSEC moves from
    cmake-2.0 to cmake-3.12, using targets. Targets are exported for
    third-party integration

  • Add visualization tools to extract user-defined properties from the
    application (see: PR 229 visualization-tools)

  • Automate expression of required data transfers from host-to-device and
    device-to-host to satisfy depencencies (and anti-dependencies). PaRSEC tracks
    multiple versions of the same data as data copies with a coherency algorithm
    that initiates data transfers as needed. The heurisitic for the eviction policy
    in out-of-memory event on GPU has been optimized to allow for efficient
    operation in larger than GPU memory problems.

  • Add support for MPI out-of-order matching capabilities; Added capability
    for compute threads to send direct control messages to indicate completion
    of tasks to remote nodes (without delegation to the communication thread)

  • Remove communication mode EAGER from the runtime. It had a rare
    but hard to correct bug that would rarely deadlock, and the performance
    benefit was small.

  • Add a Map operator on the Block Cyclic matrix data collection that
    performs in-place data transformation on the collection with a user provided
    operator.

  • Add support in the runtime for user-defined properties evaluated at
    runtime and easy to export through a shared memory region (see: PR
    229 visualization-tools)

  • Add a PAPI-SDE interface to the parsec library, to expose internal
    counters via the PAPI-Software Defined Events interface.

  • Add a backend support for OTF2 in the profiling mechanism. OTF2 is
    used automatically if a OTF2 installation is found.

  • Add a MCA parameter to control the number of ejected blocks from GPU
    memory (device_cuda_max_number_of_ejected_data). Add a MCA parameter
    to control wether or not the GPU engine will take some time to sort
    the first N tasks of the pending queue (device_cuda_sort_pending_list).

  • Reshape the users vision of PaRSEC: they only have to include a single
    header (parsec.h) for most usages, and link with a single library
    (-lparsec).

  • Update the PaRSEC DSL handling of initial tasks. We now rely on 2
    pieces of information: the number of DSL tasks, and the number of
    tasks imposed by the system (all types of data transfer).

  • Add a purely local scheduler (ll), that uses a single LIFO per
    thread. Each schedule operation does 1 atomic (push in local queue),
    each select operation does up to t atomics (pop in local queue, then
    try any other thread's queue until they are all tested empty).

  • Add a --ignore-properties=... option to parsec_ptgpp

  • Change API of hash tables: allow keys of arbitrary size. The API
    features how to build a key from a task; how to hash a key into
    1 <= N <= 64 bits; and how to compare twy keys (plus a printing
    function to debug).

  • Change behavior of DEBUG_HISTORY: log all information inside
    a buffer of fixed size (MCA parameter) per thread, do not allocate
    memory during logging, and use timestamp to re-order output
    when the user calls dump()

  • DTD interface is updated (new flag to send pointer as parameter,
    unpacking of paramteres is simpler etc).

  • DTD provides mca param (dtd_debug_verbose) to print information
    about traversal of DAG in a separate output stream from the default.