Releases: It4innovations/hyperqueue
v0.17.0
HyperQueue 0.17.0
Breaking change
Memory resource in megabytes
- Automatically detected resource "mem" that is the size of RAM of a worker is now using megabytes as a unit.
i.e.--resource mem=100
asks now for 100 MiB (previously 100 bytes).
New features
Non-integer resource requests
- You may now ask of non-integer amount of a resource. e.g. for 0.5 of GPU.
This enables resource sharing on the logical level of HyperQueue scheduler and allows to utilize remaining part the resource
by another tasks.
Job submission
- You can now specify
cleanup modes
when passingstdout
/stderr
paths to tasks. Cleanup mode decides what should
happen with the file once the task has finished executing. Currently, a single cleanup mode is implemented, which removes
the file if the task has finished successfully:
$ hq submit --stdout=out.txt:rm-if-finished /my-program
Fixes
- Fixed crash when task fails during its initialization
Artifact summary:
- hq-v0.17.0-*: Main HyperQueue build containing the
hq
binary. Download this archive to
use HyperQueue from the command line. - hyperqueue-0.17.0-*: Wheel containing the
hyperqueue
package with HyperQueue Python
bindings.
v0.17.0-rc1
HyperQueue 0.17.0-rc1
Breaking change
Memory resource in megabytes
- Automatically detected resource "mem" that is the size of RAM of a worker is now using megabytes as a unit.
i.e.--resource mem=100
asks now for 100 MiB (previously 100 bytes).
New features
Non-integer resource requests
- You may now ask of non-integer amount of a resource. e.g. for 0.5 of GPU.
This enables resource sharing on the logical level of HyperQueue scheduler and allows to utilize remaining part the resource
by another tasks.
Job submission
- You can now specify
cleanup modes
when passingstdout
/stderr
paths to tasks. Cleanup mode decides what should
happen with the file once the task has finished executing. Currently, a single cleanup mode is implemented, which removes
the file if the task has finished successfully:
$ hq submit --stdout=out.txt:rm-if-finished /my-program
Fixes
- Fixed crash when task fails during its initialization
Artifact summary:
- hq-v0.17.0-rc1-*: Main HyperQueue build containing the
hq
binary. Download this archive to
use HyperQueue from the command line. - hyperqueue-0.17.0-rc1-*: Wheel containing the
hyperqueue
package with HyperQueue Python
bindings.
v0.16.0
HyperQueue 0.16.0
New features
Pregenerating access files
- Via command
hq server generate-access
you can precreate an access file that can be later used for staring server,
and connecting workers, and clients. This is usefull in cloud environments.
Job submission
- A new command
hq job forget <job-selector>
has been introduced. It can be used to completely forget a job, and thus
reduce the memory usage of the HQ server. It is useful especially if you submit a large amount of jobs and keep the
server running for a long time.
Automatic allocation
-
Autoalloc can now execute a custom shell command/script on each worker node before the worker starts and after the
worker stops. You can use this feature e.g. to initialize some data or load software modules for each worker node.$ hq alloc add pbs --time-limit 30m \ --worker-start-cmd "/project/xxx/init-node.sh" \ --worker-stop-cmd "/project/xxx/cleanup-node.sh"
-
You can now set a time limit for workers spawned in allocations with the
--worker-time-limit
flag. You can use this
command to make workers stop sooner, so that you e.g. give more headroom for a--worker-stop-cmd
command to execute
before the allocation is terminated. If you do not use this parameter, worker time limit will be set to the time limit
of the allocation.Example:
$ hq alloc add pbs --time-limit 1h --worker-time-limit 58m --worker-stop-cmd "/project/xxxx/slow-command.sh"
In this case, the allocation will run for one hour, but the HQ worker will be stopped after 58 minutes (unless it is
stopped sooner because of idle timeout). The worker stop command will thus have at least two minutes to execute.
Changes
Access file
The format of the access file is changed. It is mostly internal change but you can experience parsing error when connecting
an old client/worker to a new server (Connecting a new client/worker to an old server will given you a proper message).
Artifact summary:
- hq-v0.16.0-*: Main HyperQueue build containing the
hq
binary. Download this archive to
use HyperQueue from the command line. - hyperqueue-0.16.0-*: Wheel containing the
hyperqueue
package with HyperQueue Python
bindings.
v0.16.0-rc1
HyperQueue 0.16.0-rc1
New features
Pregenerating access files
- Via command
hq server generate-access
you can precreate an access file that can be later used for staring server,
and connecting workers, and clients. This is usefull in cloud environments.
Job submission
- A new command
hq job forget <job-selector>
has been introduced. It can be used to completely forget a job, and thus
reduce the memory usage of the HQ server. It is useful especially if you submit a large amount of jobs and keep the
server running for a long time.
Automatic allocation
-
Autoalloc can now execute a custom shell command/script on each worker node before the worker starts and after the
worker stops. You can use this feature e.g. to initialize some data or load software modules for each worker node.$ hq alloc add pbs --time-limit 30m \ --worker-start-cmd "/project/xxx/init-node.sh" \ --worker-stop-cmd "/project/xxx/cleanup-node.sh"
-
You can now set a time limit for workers spawned in allocations with the
--worker-time-limit
flag. You can use this
command to make workers stop sooner, so that you e.g. give more headroom for a--worker-stop-cmd
command to execute
before the allocation is terminated. If you do not use this parameter, worker time limit will be set to the time limit
of the allocation.Example:
$ hq alloc add pbs --time-limit 1h --worker-time-limit 58m --worker-stop-cmd "/project/xxxx/slow-command.sh"
In this case, the allocation will run for one hour, but the HQ worker will be stopped after 58 minutes (unless it is
stopped sooner because of idle timeout). The worker stop command will thus have at least two minutes to execute.
Changes
Access file
The format of the access file is changed. It is mostly internal change but you can experience parsing error when connecting
an old client/worker to a new server (Connecting a new client/worker to an old server will given you a proper message).
Artifact summary:
- hq-v0.16.0-rc1-*: Main HyperQueue build containing the
hq
binary. Download this archive to
use HyperQueue from the command line. - hyperqueue-0.16.0-rc1-*: Wheel containing the
hyperqueue
package with HyperQueue Python
bindings.
v0.15.0
HyperQueue 0.15.0
Breaking changes
- NVIDIA GPUs are now automatically detected under the resource name
gpus/nvidia
, instead of
justgpus
! If you have been using thegpus
resource name, you should update your scripts.
See more details below.
New features
Resource management
-
You can now specify more resources for one task, e.g.: 1 cpu and 1 gpu OR 4 cpus. The scheduler considers both configurations in task planning.
For example let us assume that we have many tasks with the mentioned configuration and worker with 16 cpus and 4 gpus.
The tasks will fully utilize the node, 4 tasks will run in the configuration with gpu and 3 tasks will run in the cpu only mode. -
Job Definition File is a TOML file that can define a job.
It allows to submit complex jobs without using Python API (dependencies, resource variants, ...).$ hq job submit-file myfile.toml
-
You can now specify (indexed) resource values provided by workers as strings (previously only
integers were allowed). Notably, automatic detection of Nvidia GPUs specified with string UUIDs
now works.$ hq worker start --resource="res1=[foo, bar]"
-
HyperQueue now provides built-in support for AMD GPUs. For this reason, the default name of GPU
resources that are automatically detected on a worker has been changed fromgpus
togpus/nvidia
for NVIDIA GPUs. AMD GPUs are now autodetected asgpus/amd
. In the future, we intend to create a way
to ask for any GPU resource (e.g.--resource=gpus=2
), regardless of its type. -
AMD GPUs are now automatically detected in workers from the environment variable
ROCR_VISIBLE_DEVICES
. -
Allowed characters for resource names has been changed. The name now has to begin with an ASCII letter,
and it can only contain ASCII letters, ASCII digits and the slash (/
) symbol. This restriction is
introduced for better alignment with shells, which typically do not support complicated variable names.
HQ passes the resource names to executed tasks through environment variables, so it has to take this
into account. Note that the/
symbol in resource name will be normalized to_
when being passed
to a task. -
hq task info
now shows more information
Changes
Job submission
- The default path for
stdout
andstderr
files has been changed from%{SUBMIT_DIR}/job-%{JOB_ID}/%{TASK_ID}.[stdout/stderr]
to%{CWD}/job-%{JOB_ID}/%{TASK_ID}.[stdout/stderr]
. Note that the default value for the working
directory (%{CWD}
) is set to the submission directory, so if you have used the defaults before,
nothing will change for you. Stdout and stderr paths are now also resolved relative to the working
directory of the given task, not to the submit directory.
Artifact summary:
- hq-v0.15.0-*: Main HyperQueue build containing the
hq
binary. Download this archive to
use HyperQueue from the command line. - hyperqueue-0.15.0-*: Wheel containing the
hyperqueue
package with HyperQueue Python
bindings.
v0.15.0-rc1
HyperQueue 0.15.0-rc1
Breaking changes
- NVIDIA GPUs are now automatically detected under the resource name
gpus/nvidia
, instead of
justgpus
! If you have been using thegpus
resource name, you should update your scripts.
See more details below.
New features
Resource management
-
You can now specify more resources for one task, e.g.: 1 cpu and 1 gpu OR 4 cpus. The scheduler considers both configurations in task planning.
For example let us assume that we have many tasks with the mentioned configuration and worker with 16 cpus and 4 gpus.
The tasks will fully utilize the node, 4 tasks will run in the configuration with gpu and 3 tasks will run in the cpu only mode. -
Job Definition File is a TOML file that can define a job.
It allows to submit complex jobs without using Python API (dependencies, resource variants, ...).$ hq job submit-file myfile.toml
-
You can now specify (indexed) resource values provided by workers as strings (previously only
integers were allowed). Notably, automatic detection of Nvidia GPUs specified with string UUIDs
now works.$ hq worker start --resource="res1=[foo, bar]"
-
HyperQueue now provides built-in support for AMD GPUs. For this reason, the default name of GPU
resources that are automatically detected on a worker has been changed fromgpus
togpus/nvidia
for NVIDIA GPUs. AMD GPUs are now autodetected asgpus/amd
. In the future, we intend to create a way
to ask for any GPU resource (e.g.--resource=gpus=2
), regardless of its type. -
AMD GPUs are now automatically detected in workers from the environment variable
ROCR_VISIBLE_DEVICES
. -
Allowed characters for resource names has been changed. The name now has to begin with an ASCII letter,
and it can only contain ASCII letters, ASCII digits and the slash (/
) symbol. This restriction is
introduced for better alignment with shells, which typically do not support complicated variable names.
HQ passes the resource names to executed tasks through environment variables, so it has to take this
into account. Note that the/
symbol in resource name will be normalized to_
when being passed
to a task. -
hq task info
now shows more information
Changes
Job submission
- The default path for
stdout
andstderr
files has been changed from%{SUBMIT_DIR}/job-%{JOB_ID}/%{TASK_ID}.[stdout/stderr]
to%{CWD}/job-%{JOB_ID}/%{TASK_ID}.[stdout/stderr]
. Note that the default value for the working
directory (%{CWD}
) is set to the submission directory, so if you have used the defaults before,
nothing will change for you. Stdout and stderr paths are now also resolved relative to the working
directory of the given task, not to the submit directory.
Artifact summary:
- hq-v0.15.0-rc1-*: Main HyperQueue build containing the
hq
binary. Download this archive to
use HyperQueue from the command line. - hyperqueue-0.15.0-rc1-*: Wheel containing the
hyperqueue
package with HyperQueue Python
bindings.
v0.14.0
HyperQueue 0.14.0
New features
CLI
- #545 Add a new command
hq job summary
,
which displays the amount of jobs per each job state.
Platforms
- HQ can be now compiled for Raspbery Pi
Fixes
Worker
- #539 Fix connection of worker to server
in the presence of both IPv4 and IPv6 addresses.
Job submission
- #540 Parse all arguments from shebang
in a directives file (e.g.#!/bin/bash -l
).
Streaming
- Fixed a bug in closing streaming when tasks are very short and sychronized.
Artifact summary:
- hq-v0.14.0-*: Main HyperQueue build containing the
hq
binary. Download this archive to
use HyperQueue from the command line. - hyperqueue-0.14.0-*: Wheel containing the
hyperqueue
package with HyperQueue Python
bindings.
v0.14.0-rc1
HyperQueue 0.14.0-rc1
New features
CLI
- #545 Add a new command
hq job summary
,
which displays the amount of jobs per each job state.
Platforms
- HQ can be now compiled for Raspbery Pi
Fixes
Worker
- #539 Fix connection of worker to server
in the presence of both IPv4 and IPv6 addresses.
Job submission
- #540 Parse all arguments from shebang
in a directives file (e.g.#!/bin/bash -l
).
Streaming
- Fixed a bug in closing streaming when tasks are very short and sychronized.
Artifact summary:
- hq-v0.14.0-rc1-*: Main HyperQueue build containing the
hq
binary. Download this archive to
use HyperQueue from the command line. - hyperqueue-0.14.0-rc1-*: Wheel containing the
hyperqueue
package with HyperQueue Python
bindings.
v0.13.0
HyperQueue 0.13.0
New features
Resource management
-
Almost complete rewrite of resource management.
CPU and other resources were unified: the most visible change is that you can define "cpus" and other resource;
and other resources can now be defined in groups (NUMA-like resources). -
Many improvements in scheduler: Improved schedules for multi-resource requests;
better behavior on non-heterogeneous clusters;
better interaction between resources and priorities.
Automatic allocation
- #467 You can now pause (and resume)
autoalloc queues usinghq alloc pause
andhq alloc resume
.
Paused queues will not submit new allocations into the selected job manager. They can be later resumed.
When an autoalloc queue hits too many submission or worker execution errors, it will now be paused
instead of removed.
Tasks
-
HQ allows to limit how many times a task may be in a running state while worker is lost
(such a task may be a potential source of worker's crash).
If the limit is reached, the task is marked as failed.
The limit can be configured by--crash-limit
in submit. -
Groups of workers are introduced. A multi-node task is now started only on workers from the same group.
By default, workers are grouped by PBS/Slurm allocations, but it can be configured manually.
Changes
Resource management
--cpus=no-ht
is now changed to a flag--no-hyper-threading
.- Explicit list definition of a resource was changed from
--resource xxx=list(1,2,3)
to--resource xxx=[1,2,3]
.
(this is the result of unification of CPUs with other resources). - Python API: Attribute
generic
inResourceRequest
is renamed toresources
Tasks
- #461 When a task is cancelled, times out
or its worker is killed, HyperQueue now tries to make sure that both the tasks and any processes that
it has spawned will be also terminated. - #480 You can now select multiple tasks in
hq task info
.
Artifact summary:
- hq-v0.13.0-*: Main HyperQueue build containing the
hq
binary. Download this archive to
use HyperQueue from the command line. - hyperqueue-0.13.0-*: Wheel containing the
hyperqueue
package with HyperQueue Python
bindings.
v0.13.0-rc1
HyperQueue 0.13.0-rc1
New features
Resource management
-
Almost complete rewrite of resource management.
CPU and other resources were unified: the most visible change is that you can define "cpus" and other resource;
and other resources can now be defined in groups (NUMA-like resources). -
Many improvements in scheduler: Improved schedules for multi-resource requests;
better behavior on non-heterogeneous clusters;
better interaction between resources and priorities.
Automatic allocation
- #467 You can now pause (and resume)
autoalloc queues usinghq alloc pause
andhq alloc resume
.
Paused queues will not submit new allocations into the selected job manager. They can be later resumed.
When an autoalloc queue hits too many submission or worker execution errors, it will now be paused
instead of removed.
Tasks
-
HQ allows to limit how many times a task may be in a running state while worker is lost
(such a task may be a potential source of worker's crash).
If the limit is reached, the task is marked as failed.
The limit can be configured by--crash-limit
in submit. -
Groups of workers are introduced. A multi-node task is now started only on workers from the same group.
By default, workers are grouped by PBS/Slurm allocations, but it can be configured manually.
Changes
Resource management
--cpus=no-ht
is now changed to a flag--no-hyper-threading
.- Explicit list definition of a resource was changed from
--resource xxx=list(1,2,3)
to--resource xxx=[1,2,3]
.
(this is the result of unification of CPUs with other resources). - Python API: Attribute
generic
inResourceRequest
is renamed toresources
Tasks
- #461 When a task is cancelled, times out
or its worker is killed, HyperQueue now tries to make sure that both the tasks and any processes that
it has spawned will be also terminated. - #480 You can now select multiple tasks in
hq task info
.
Artifact summary:
- hq-v0.13.0-rc1-*: Main HyperQueue build containing the
hq
binary. Download this archive to
use HyperQueue from the command line. - hyperqueue-0.13.0-rc1-*: Wheel containing the
hyperqueue
package with HyperQueue Python
bindings.