Skip to content

Releases: It4innovations/hyperqueue

v0.17.0

01 Nov 10:12
Compare
Choose a tag to compare

HyperQueue 0.17.0

Breaking change

Memory resource in megabytes

  • Automatically detected resource "mem" that is the size of RAM of a worker is now using megabytes as a unit.
    i.e. --resource mem=100 asks now for 100 MiB (previously 100 bytes).

New features

Non-integer resource requests

  • You may now ask of non-integer amount of a resource. e.g. for 0.5 of GPU.
    This enables resource sharing on the logical level of HyperQueue scheduler and allows to utilize remaining part the resource
    by another tasks.

Job submission

  • You can now specify cleanup modes when passing stdout/stderr paths to tasks. Cleanup mode decides what should
    happen with the file once the task has finished executing. Currently, a single cleanup mode is implemented, which removes
    the file if the task has finished successfully:
$ hq submit --stdout=out.txt:rm-if-finished /my-program

Fixes

  • Fixed crash when task fails during its initialization

Artifact summary:

  • hq-v0.17.0-*: Main HyperQueue build containing the hq binary. Download this archive to
    use HyperQueue from the command line
    .
  • hyperqueue-0.17.0-*: Wheel containing the hyperqueue package with HyperQueue Python
    bindings.

v0.17.0-rc1

25 Oct 11:52
Compare
Choose a tag to compare
v0.17.0-rc1 Pre-release
Pre-release

HyperQueue 0.17.0-rc1

Breaking change

Memory resource in megabytes

  • Automatically detected resource "mem" that is the size of RAM of a worker is now using megabytes as a unit.
    i.e. --resource mem=100 asks now for 100 MiB (previously 100 bytes).

New features

Non-integer resource requests

  • You may now ask of non-integer amount of a resource. e.g. for 0.5 of GPU.
    This enables resource sharing on the logical level of HyperQueue scheduler and allows to utilize remaining part the resource
    by another tasks.

Job submission

  • You can now specify cleanup modes when passing stdout/stderr paths to tasks. Cleanup mode decides what should
    happen with the file once the task has finished executing. Currently, a single cleanup mode is implemented, which removes
    the file if the task has finished successfully:
$ hq submit --stdout=out.txt:rm-if-finished /my-program

Fixes

  • Fixed crash when task fails during its initialization

Artifact summary:

  • hq-v0.17.0-rc1-*: Main HyperQueue build containing the hq binary. Download this archive to
    use HyperQueue from the command line
    .
  • hyperqueue-0.17.0-rc1-*: Wheel containing the hyperqueue package with HyperQueue Python
    bindings.

v0.16.0

12 Jul 19:30
Compare
Choose a tag to compare

HyperQueue 0.16.0

New features

Pregenerating access files

  • Via command hq server generate-access you can precreate an access file that can be later used for staring server,
    and connecting workers, and clients. This is usefull in cloud environments.

Job submission

  • A new command hq job forget <job-selector> has been introduced. It can be used to completely forget a job, and thus
    reduce the memory usage of the HQ server. It is useful especially if you submit a large amount of jobs and keep the
    server running for a long time.

Automatic allocation

  • Autoalloc can now execute a custom shell command/script on each worker node before the worker starts and after the
    worker stops. You can use this feature e.g. to initialize some data or load software modules for each worker node.

    $ hq alloc add pbs --time-limit 30m \
      --worker-start-cmd "/project/xxx/init-node.sh" \
      --worker-stop-cmd "/project/xxx/cleanup-node.sh"
  • You can now set a time limit for workers spawned in allocations with the --worker-time-limit flag. You can use this
    command to make workers stop sooner, so that you e.g. give more headroom for a --worker-stop-cmd command to execute
    before the allocation is terminated. If you do not use this parameter, worker time limit will be set to the time limit
    of the allocation.

    Example:

    $ hq alloc add pbs --time-limit 1h --worker-time-limit 58m --worker-stop-cmd "/project/xxxx/slow-command.sh"

    In this case, the allocation will run for one hour, but the HQ worker will be stopped after 58 minutes (unless it is
    stopped sooner because of idle timeout). The worker stop command will thus have at least two minutes to execute.

Changes

Access file

The format of the access file is changed. It is mostly internal change but you can experience parsing error when connecting
an old client/worker to a new server (Connecting a new client/worker to an old server will given you a proper message).

Artifact summary:

  • hq-v0.16.0-*: Main HyperQueue build containing the hq binary. Download this archive to
    use HyperQueue from the command line
    .
  • hyperqueue-0.16.0-*: Wheel containing the hyperqueue package with HyperQueue Python
    bindings.

v0.16.0-rc1

08 Jul 18:12
Compare
Choose a tag to compare
v0.16.0-rc1 Pre-release
Pre-release

HyperQueue 0.16.0-rc1

New features

Pregenerating access files

  • Via command hq server generate-access you can precreate an access file that can be later used for staring server,
    and connecting workers, and clients. This is usefull in cloud environments.

Job submission

  • A new command hq job forget <job-selector> has been introduced. It can be used to completely forget a job, and thus
    reduce the memory usage of the HQ server. It is useful especially if you submit a large amount of jobs and keep the
    server running for a long time.

Automatic allocation

  • Autoalloc can now execute a custom shell command/script on each worker node before the worker starts and after the
    worker stops. You can use this feature e.g. to initialize some data or load software modules for each worker node.

    $ hq alloc add pbs --time-limit 30m \
      --worker-start-cmd "/project/xxx/init-node.sh" \
      --worker-stop-cmd "/project/xxx/cleanup-node.sh"
  • You can now set a time limit for workers spawned in allocations with the --worker-time-limit flag. You can use this
    command to make workers stop sooner, so that you e.g. give more headroom for a --worker-stop-cmd command to execute
    before the allocation is terminated. If you do not use this parameter, worker time limit will be set to the time limit
    of the allocation.

    Example:

    $ hq alloc add pbs --time-limit 1h --worker-time-limit 58m --worker-stop-cmd "/project/xxxx/slow-command.sh"

    In this case, the allocation will run for one hour, but the HQ worker will be stopped after 58 minutes (unless it is
    stopped sooner because of idle timeout). The worker stop command will thus have at least two minutes to execute.

Changes

Access file

The format of the access file is changed. It is mostly internal change but you can experience parsing error when connecting
an old client/worker to a new server (Connecting a new client/worker to an old server will given you a proper message).

Artifact summary:

  • hq-v0.16.0-rc1-*: Main HyperQueue build containing the hq binary. Download this archive to
    use HyperQueue from the command line
    .
  • hyperqueue-0.16.0-rc1-*: Wheel containing the hyperqueue package with HyperQueue Python
    bindings.

v0.15.0

17 Apr 19:54
Compare
Choose a tag to compare

HyperQueue 0.15.0

Breaking changes

  • NVIDIA GPUs are now automatically detected under the resource name gpus/nvidia, instead of
    just gpus!
    If you have been using the gpus resource name, you should update your scripts.
    See more details below.

New features

Resource management

  • You can now specify more resources for one task, e.g.: 1 cpu and 1 gpu OR 4 cpus. The scheduler considers both configurations in task planning.
    For example let us assume that we have many tasks with the mentioned configuration and worker with 16 cpus and 4 gpus.
    The tasks will fully utilize the node, 4 tasks will run in the configuration with gpu and 3 tasks will run in the cpu only mode.

  • Job Definition File is a TOML file that can define a job.
    It allows to submit complex jobs without using Python API (dependencies, resource variants, ...).

    $ hq job submit-file myfile.toml
  • You can now specify (indexed) resource values provided by workers as strings (previously only
    integers were allowed). Notably, automatic detection of Nvidia GPUs specified with string UUIDs
    now works.

    $ hq worker start --resource="res1=[foo, bar]"
  • HyperQueue now provides built-in support for AMD GPUs. For this reason, the default name of GPU
    resources that are automatically detected on a worker has been changed from gpus to gpus/nvidia
    for NVIDIA GPUs. AMD GPUs are now autodetected as gpus/amd. In the future, we intend to create a way
    to ask for any GPU resource (e.g. --resource=gpus=2), regardless of its type.

  • AMD GPUs are now automatically detected in workers from the environment variable ROCR_VISIBLE_DEVICES.

  • Allowed characters for resource names has been changed. The name now has to begin with an ASCII letter,
    and it can only contain ASCII letters, ASCII digits and the slash (/) symbol. This restriction is
    introduced for better alignment with shells, which typically do not support complicated variable names.
    HQ passes the resource names to executed tasks through environment variables, so it has to take this
    into account. Note that the / symbol in resource name will be normalized to _ when being passed
    to a task.

  • hq task info now shows more information

Changes

Job submission

  • The default path for stdout and stderr files has been changed from %{SUBMIT_DIR}/job-%{JOB_ID}/%{TASK_ID}.[stdout/stderr]
    to %{CWD}/job-%{JOB_ID}/%{TASK_ID}.[stdout/stderr]. Note that the default value for the working
    directory (%{CWD}) is set to the submission directory, so if you have used the defaults before,
    nothing will change for you. Stdout and stderr paths are now also resolved relative to the working
    directory of the given task, not to the submit directory.

Artifact summary:

  • hq-v0.15.0-*: Main HyperQueue build containing the hq binary. Download this archive to
    use HyperQueue from the command line
    .
  • hyperqueue-0.15.0-*: Wheel containing the hyperqueue package with HyperQueue Python
    bindings.

v0.15.0-rc1

12 Apr 20:14
Compare
Choose a tag to compare
v0.15.0-rc1 Pre-release
Pre-release

HyperQueue 0.15.0-rc1

Breaking changes

  • NVIDIA GPUs are now automatically detected under the resource name gpus/nvidia, instead of
    just gpus!
    If you have been using the gpus resource name, you should update your scripts.
    See more details below.

New features

Resource management

  • You can now specify more resources for one task, e.g.: 1 cpu and 1 gpu OR 4 cpus. The scheduler considers both configurations in task planning.
    For example let us assume that we have many tasks with the mentioned configuration and worker with 16 cpus and 4 gpus.
    The tasks will fully utilize the node, 4 tasks will run in the configuration with gpu and 3 tasks will run in the cpu only mode.

  • Job Definition File is a TOML file that can define a job.
    It allows to submit complex jobs without using Python API (dependencies, resource variants, ...).

    $ hq job submit-file myfile.toml
  • You can now specify (indexed) resource values provided by workers as strings (previously only
    integers were allowed). Notably, automatic detection of Nvidia GPUs specified with string UUIDs
    now works.

    $ hq worker start --resource="res1=[foo, bar]"
  • HyperQueue now provides built-in support for AMD GPUs. For this reason, the default name of GPU
    resources that are automatically detected on a worker has been changed from gpus to gpus/nvidia
    for NVIDIA GPUs. AMD GPUs are now autodetected as gpus/amd. In the future, we intend to create a way
    to ask for any GPU resource (e.g. --resource=gpus=2), regardless of its type.

  • AMD GPUs are now automatically detected in workers from the environment variable ROCR_VISIBLE_DEVICES.

  • Allowed characters for resource names has been changed. The name now has to begin with an ASCII letter,
    and it can only contain ASCII letters, ASCII digits and the slash (/) symbol. This restriction is
    introduced for better alignment with shells, which typically do not support complicated variable names.
    HQ passes the resource names to executed tasks through environment variables, so it has to take this
    into account. Note that the / symbol in resource name will be normalized to _ when being passed
    to a task.

  • hq task info now shows more information

Changes

Job submission

  • The default path for stdout and stderr files has been changed from %{SUBMIT_DIR}/job-%{JOB_ID}/%{TASK_ID}.[stdout/stderr]
    to %{CWD}/job-%{JOB_ID}/%{TASK_ID}.[stdout/stderr]. Note that the default value for the working
    directory (%{CWD}) is set to the submission directory, so if you have used the defaults before,
    nothing will change for you. Stdout and stderr paths are now also resolved relative to the working
    directory of the given task, not to the submit directory.

Artifact summary:

  • hq-v0.15.0-rc1-*: Main HyperQueue build containing the hq binary. Download this archive to
    use HyperQueue from the command line
    .
  • hyperqueue-0.15.0-rc1-*: Wheel containing the hyperqueue package with HyperQueue Python
    bindings.

v0.14.0

02 Feb 10:28
Compare
Choose a tag to compare

HyperQueue 0.14.0

New features

CLI

  • #545 Add a new command hq job summary,
    which displays the amount of jobs per each job state.

Platforms

  • HQ can be now compiled for Raspbery Pi

Fixes

Worker

  • #539 Fix connection of worker to server
    in the presence of both IPv4 and IPv6 addresses.

Job submission

  • #540 Parse all arguments from shebang
    in a directives file (e.g. #!/bin/bash -l).

Streaming

  • Fixed a bug in closing streaming when tasks are very short and sychronized.

Artifact summary:

  • hq-v0.14.0-*: Main HyperQueue build containing the hq binary. Download this archive to
    use HyperQueue from the command line
    .
  • hyperqueue-0.14.0-*: Wheel containing the hyperqueue package with HyperQueue Python
    bindings.

v0.14.0-rc1

29 Jan 18:43
Compare
Choose a tag to compare
v0.14.0-rc1 Pre-release
Pre-release

HyperQueue 0.14.0-rc1

New features

CLI

  • #545 Add a new command hq job summary,
    which displays the amount of jobs per each job state.

Platforms

  • HQ can be now compiled for Raspbery Pi

Fixes

Worker

  • #539 Fix connection of worker to server
    in the presence of both IPv4 and IPv6 addresses.

Job submission

  • #540 Parse all arguments from shebang
    in a directives file (e.g. #!/bin/bash -l).

Streaming

  • Fixed a bug in closing streaming when tasks are very short and sychronized.

Artifact summary:

  • hq-v0.14.0-rc1-*: Main HyperQueue build containing the hq binary. Download this archive to
    use HyperQueue from the command line
    .
  • hyperqueue-0.14.0-rc1-*: Wheel containing the hyperqueue package with HyperQueue Python
    bindings.

v0.13.0

04 Nov 09:59
Compare
Choose a tag to compare

HyperQueue 0.13.0

New features

Resource management

  • Almost complete rewrite of resource management.
    CPU and other resources were unified: the most visible change is that you can define "cpus" and other resource;
    and other resources can now be defined in groups (NUMA-like resources).

  • Many improvements in scheduler: Improved schedules for multi-resource requests;
    better behavior on non-heterogeneous clusters;
    better interaction between resources and priorities.

Automatic allocation

  • #467 You can now pause (and resume)
    autoalloc queues using hq alloc pause and hq alloc resume.
    Paused queues will not submit new allocations into the selected job manager. They can be later resumed.
    When an autoalloc queue hits too many submission or worker execution errors, it will now be paused
    instead of removed.

Tasks

  • HQ allows to limit how many times a task may be in a running state while worker is lost
    (such a task may be a potential source of worker's crash).
    If the limit is reached, the task is marked as failed.
    The limit can be configured by --crash-limit in submit.

  • Groups of workers are introduced. A multi-node task is now started only on workers from the same group.
    By default, workers are grouped by PBS/Slurm allocations, but it can be configured manually.

Changes

Resource management

  • --cpus=no-ht is now changed to a flag --no-hyper-threading.
  • Explicit list definition of a resource was changed from --resource xxx=list(1,2,3) to --resource xxx=[1,2,3].
    (this is the result of unification of CPUs with other resources).
  • Python API: Attribute generic in ResourceRequest is renamed to resources

Tasks

  • #461 When a task is cancelled, times out
    or its worker is killed, HyperQueue now tries to make sure that both the tasks and any processes that
    it has spawned will be also terminated.
  • #480 You can now select multiple tasks in hq task info.

Artifact summary:

  • hq-v0.13.0-*: Main HyperQueue build containing the hq binary. Download this archive to
    use HyperQueue from the command line
    .
  • hyperqueue-0.13.0-*: Wheel containing the hyperqueue package with HyperQueue Python
    bindings.

v0.13.0-rc1

02 Nov 15:18
Compare
Choose a tag to compare
v0.13.0-rc1 Pre-release
Pre-release

HyperQueue 0.13.0-rc1

New features

Resource management

  • Almost complete rewrite of resource management.
    CPU and other resources were unified: the most visible change is that you can define "cpus" and other resource;
    and other resources can now be defined in groups (NUMA-like resources).

  • Many improvements in scheduler: Improved schedules for multi-resource requests;
    better behavior on non-heterogeneous clusters;
    better interaction between resources and priorities.

Automatic allocation

  • #467 You can now pause (and resume)
    autoalloc queues using hq alloc pause and hq alloc resume.
    Paused queues will not submit new allocations into the selected job manager. They can be later resumed.
    When an autoalloc queue hits too many submission or worker execution errors, it will now be paused
    instead of removed.

Tasks

  • HQ allows to limit how many times a task may be in a running state while worker is lost
    (such a task may be a potential source of worker's crash).
    If the limit is reached, the task is marked as failed.
    The limit can be configured by --crash-limit in submit.

  • Groups of workers are introduced. A multi-node task is now started only on workers from the same group.
    By default, workers are grouped by PBS/Slurm allocations, but it can be configured manually.

Changes

Resource management

  • --cpus=no-ht is now changed to a flag --no-hyper-threading.
  • Explicit list definition of a resource was changed from --resource xxx=list(1,2,3) to --resource xxx=[1,2,3].
    (this is the result of unification of CPUs with other resources).
  • Python API: Attribute generic in ResourceRequest is renamed to resources

Tasks

  • #461 When a task is cancelled, times out
    or its worker is killed, HyperQueue now tries to make sure that both the tasks and any processes that
    it has spawned will be also terminated.
  • #480 You can now select multiple tasks in hq task info.

Artifact summary:

  • hq-v0.13.0-rc1-*: Main HyperQueue build containing the hq binary. Download this archive to
    use HyperQueue from the command line
    .
  • hyperqueue-0.13.0-rc1-*: Wheel containing the hyperqueue package with HyperQueue Python
    bindings.