Releases: It4innovations/hyperqueue
v0.12.0-rc1
HyperQueue 0.12.0-rc1
New features
Automatic allocation
- #457 You can now specify the idle timeout
for workers started by the automatic allocator using the--idle-timeout
flag of thehq alloc add
command.
Resiliency
- #449 Tasks that were present during multiple
crashes of the workers will be canceled.
CLI
- #463 You can now wait until
N
workers
are connected to the clusters withhq worker wait N
.
Python API
- Resource requests improvements in Python API.
Changes
CLI
-
#477 Requested resources are now shown while
submitting anarray
and while viewing information about taskTASK_ID
of specified
jobJOB_ID
usinghq task info JOB_ID TASK_ID
-
#444 The
hq task list
command will now
hide some details by default, to conserve space in terminal output. To show all details, use the
-v
flag to enable verbose output. -
#455 Improve the quality of error messages
produced when parsing various CLI parameters, like resources.
Automatic allocation
- #448 The automatic allocator will now start
workers in multi-node Slurm allocations usingsrun --overlap
. This should avoid taking up Slurm
task resources by the started workers (if possible). If you run into any issues with usingsrun
inside HyperQueue tasks, please let us know.
Jobs
- #483 There is no longer a length limit
for job names.
Fixes
Job submission
- #450 Attempts to resubmit a job with zero
tasks will now result in an explicit error, rather than a crash of the client.
Artifact summary:
- hq-v0.12.0-rc1-*: Main HyperQueue build containing the
hq
binary. Download this archive to
use HyperQueue from the command line. - hyperqueue-0.12.0-rc1-*: Wheel containing the
hyperqueue
package with HyperQueue Python
bindings.
v0.11.0-ligate1
HyperQueue 0.11.0-ligate1
New features
CLI
- #423 You can now specify the server
directory using theHQ_SERVER_DIR
environment variable.
Resource management
- #427 A new specifier has been added to
specify indexed pool resources for workers as a set of individual resource indices.$ hq worker start --resource "gpus=list(1,3,8)"
- #428 Workers will now attempt to automatically
detect available GPU resources from theCUDA_VISIBLE_DEVICES
environment variable.
Stream log
- Basic export of stream log into JSON (
hq log <log_file> export
)
Server
-
Improved scheduling of multi-node tasks.
-
Server now generates a random unique ID (UID) string every time a new server is started (
hq server start
).
It can be used as a placeholder%{SERVER_ID}
.
Changes
CLI
-
#433 (Backwards incompatible change)
The CLI commandhq job tasks
has been removed and its functionality has been incorporated into the
hq task list
command instead.
resource requests, -
#420 Shebang (e.g.
#!/bin/bash
) will
now be read from submitted program based on the provided
directives mode. If a shebang
is found, HQ will execute the program located at the shebang path and pass it the rest of the
submitted arguments.By default, directives and shebang will be read from the submitted program only if its filename ends
with.sh
. If you want to explicitly enable reading the shebang, pass--directives=file
to
hq submit
.Another change is that the shebang is now read by the client (i.e. it will be read on the node that
submits the job), not on worker nodes as previously. This means that the submitted file has to be
accessible on the client node.
Resource management
-
#427 (Backwards incompatible change)
The environment variableHQ_RESOURCE_INDICES_<resource-name>
, which is passed to tasks with
resource requests,
has been renamed toHQ_RESOURCE_VALUES_<resource-name>
. -
#427 (Backwards incompatible change)
The specifier for specifying indexed pool resources for workers as a range has been renamed from
indices
torange
.# before $ hq worker start --resource "gpus=indices(1-3)" # now $ hq worker start --resource "gpus=range(1-3)"
-
#427 The
generic resource
documentation has been rewritten and improved.
Artifact summary:
- hq-v0.11.0-ligate1-*: Main HyperQueue build containing the
hq
binary. Download this archive to
use HyperQueue from the command line. - hyperqueue-0.11.0-ligate1-*: Wheel containing the
hyperqueue
package with HyperQueue Python
bindings.
v0.11.0
HyperQueue 0.11.0
New features
CLI
- #423 You can now specify the server
directory using theHQ_SERVER_DIR
environment variable.
Resource management
- #427 A new specifier has been added to
specify indexed pool resources for workers as a set of individual resource indices.$ hq worker start --resource "gpus=list(1,3,8)"
- #428 Workers will now attempt to automatically
detect available GPU resources from theCUDA_VISIBLE_DEVICES
environment variable.
Stream log
- Basic export of stream log into JSON (
hq log <log_file> export
)
Server
-
Improved scheduling of multi-node tasks.
-
Server now generates a random unique ID (UID) string every time a new server is started (
hq server start
).
It can be used as a placeholder%{SERVER_ID}
.
Changes
CLI
-
#433 (Backwards incompatible change)
The CLI commandhq job tasks
has been removed and its functionality has been incorporated into the
hq task list
command instead.
resource requests, -
#420 Shebang (e.g.
#!/bin/bash
) will
now be read from submitted program based on the provided
directives mode. If a shebang
is found, HQ will execute the program located at the shebang path and pass it the rest of the
submitted arguments.By default, directives and shebang will be read from the submitted program only if its filename ends
with.sh
. If you want to explicitly enable reading the shebang, pass--directives=file
to
hq submit
.Another change is that the shebang is now read by the client (i.e. it will be read on the node that
submits the job), not on worker nodes as previously. This means that the submitted file has to be
accessible on the client node.
Resource management
-
#427 (Backwards incompatible change)
The environment variableHQ_RESOURCE_INDICES_<resource-name>
, which is passed to tasks with
resource requests,
has been renamed toHQ_RESOURCE_VALUES_<resource-name>
. -
#427 (Backwards incompatible change)
The specifier for specifying indexed pool resources for workers as a range has been renamed from
indices
torange
.# before $ hq worker start --resource "gpus=indices(1-3)" # now $ hq worker start --resource "gpus=range(1-3)"
-
#427 The
generic resource
documentation has been rewritten and improved.
Artifact summary:
- hq-v0.11.0-*: Main HyperQueue build containing the
hq
binary. Download this archive to
use HyperQueue from the command line. - hyperqueue-0.11.0-*: Wheel containing the
hyperqueue
package with HyperQueue Python
bindings.
v0.11.0-rc1
HyperQueue 0.11.0-rc1
New features
CLI
- #423 You can now specify the server
directory using theHQ_SERVER_DIR
environment variable.
Resource management
- #427 A new specifier has been added to
specify indexed pool resources for workers as a set of individual resource indices.$ hq worker start --resource "gpus=list(1,3,8)"
- #428 Workers will now attempt to automatically
detect available GPU resources from theCUDA_VISIBLE_DEVICES
environment variable.
Stream log
- Basic export of stream log into JSON (
hq log <log_file> export
)
Server
-
Improved scheduling of multi-node tasks.
-
Server now generates a random unique ID (UID) string every time a new server is started (
hq server start
).
It can be used as a placeholder%{SERVER_ID}
.
Changes
CLI
-
#433 (Backwards incompatible change)
The CLI commandhq job tasks
has been removed and its functionality has been incorporated into the
hq task list
command instead.
resource requests, -
#420 Shebang (e.g.
#!/bin/bash
) will
now be read from submitted program based on the provided
directives mode. If a shebang
is found, HQ will execute the program located at the shebang path and pass it the rest of the
submitted arguments.By default, directives and shebang will be read from the submitted program only if its filename ends
with.sh
. If you want to explicitly enable reading the shebang, pass--directives=file
to
hq submit
.Another change is that the shebang is now read by the client (i.e. it will be read on the node that
submits the job), not on worker nodes as previously. This means that the submitted file has to be
accessible on the client node.
Resource management
-
#427 (Backwards incompatible change)
The environment variableHQ_RESOURCE_INDICES_<resource-name>
, which is passed to tasks with
resource requests,
has been renamed toHQ_RESOURCE_VALUES_<resource-name>
. -
#427 (Backwards incompatible change)
The specifier for specifying indexed pool resources for workers as a range has been renamed from
indices
torange
.# before $ hq worker start --resource "gpus=indices(1-3)" # now $ hq worker start --resource "gpus=range(1-3)"
-
#427 The
generic resource
documentation has been rewritten and improved.
Artifact summary:
- hq-v0.11.0-rc1-*: Main HyperQueue build containing the
hq
binary. Download this archive to
use HyperQueue from the command line. - hyperqueue-0.11.0-rc1-*: Wheel containing the
hyperqueue
package with HyperQueue Python
bindings.
v0.10.0
HyperQueue 0.10.0
New features
Running tasks
- HQ will now set the OpenMP
OMP_NUM_THREADS
environment variable for each task. The amount of threads
will be set according to the number of requested cores. For example, this job submission:
$ hq submit --cpus=4 -- <program>
would pass OMP_NUM_THREADS=4
to the executed <program>
.
-
New task OpenMP pinning mode was added. You can now use
--pin=omp
when submitting jobs. This
CPU pin mode will generate the correspondingOMP_PLACES
andOMP_PROC_BIND
environment variables
to make sure that OpenMP pins its threads to the exact cores allocated by HyperQueue. -
Preview version of multi-node tasks. You may submit multi-node task by
hq submit --nodes=X ...
CLI
- Less verbose log output by default. You can use "--debug" to turn on the old behavior.
Changes
Scheduler
- When there is only a few tasks, scheduler tries to fit tasks on fewer workers.
Goal is to enable earlier stopping of workers because of idle timeout.
CLI
- The
--pin
boolean option for submitting jobs has been changed to take a value. You can get the
original behaviour by specifying--pin=taskset
.
Fixes
Automatic allocation
- PBS/Slurm allocations using multiple workers will now correctly spawn a HyperQueue worker on all
allocated nodes.
Artifact summary:
- hq-v0.10.0-*: Main HyperQueue build containing the
hq
binary. Download this archive to
use HyperQueue from the command line. - hyperqueue-0.10.0-*: Wheel containing the
hyperqueue
package with HyperQueue Python
bindings.
v0.10.0-rc1
HyperQueue 0.10.0-rc1
New features
Running tasks
- HQ will now set the OpenMP
OMP_NUM_THREADS
environment variable for each task. The amount of threads
will be set according to the number of requested cores. For example, this job submission:
$ hq submit --cpus=4 -- <program>
would pass OMP_NUM_THREADS=4
to the executed <program>
.
-
New task OpenMP pinning mode was added. You can now use
--pin=omp
when submitting jobs. This
CPU pin mode will generate the correspondingOMP_PLACES
andOMP_PROC_BIND
environment variables
to make sure that OpenMP pins its threads to the exact cores allocated by HyperQueue. -
Preview version of multi-node tasks. You may submit multi-node task by
hq submit --nodes=X ...
CLI
- Less verbose log output by default. You can use "--debug" to turn on the old behavior.
Changes
Scheduler
- When there is only a few tasks, scheduler tries to fit tasks on fewer workers.
Goal is to enable earlier stopping of workers because of idle timeout.
CLI
- The
--pin
boolean option for submitting jobs has been changed to take a value. You can get the
original behaviour by specifying--pin=taskset
.
Fixes
Automatic allocation
- PBS/Slurm allocations using multiple workers will now correctly spawn a HyperQueue worker on all
allocated nodes.
Artifact summary:
- hq-v0.10.0-rc1-*: Main HyperQueue build containing the
hq
binary. Download this archive to
use HyperQueue from the command line. - hyperqueue-0.10.0-rc1-*: Wheel containing the
hyperqueue
package with HyperQueue Python
bindings.
v0.9.0
HyperQueue 0.9.0
New features
Tasks
-
Task may be started with a temporary directory that is automatically deleted when the task is finished.
(flag--task-dir
). -
Task may provide its own error message by creating a file with name passed by environment variable
HQ_ERROR_FILENAME
.
CLI
- You can now use the
hq task list <job-selector>
command to display a list of tasks across multiple jobs. - Add
--filter
flag toworker list
to allow filtering workers by their status.
Changes
Automatic allocation
- Automatic allocation has been rewritten from scratch. It will no longer query PBS/Slurm allocation
statuses periodically, instead it will try to derive allocation state from workers that connect
to it from allocations. - When adding a new allocation queue, HyperQueue will now try to immediately submit a job into the queue
to quickly test whether the entered configuration is correct. If you want to avoid this behaviour, you
can use the--no-dry-run
flag forhq alloc add <pbs/slurm>
. - If too many submissions (10) or running allocations (3) fail in a succession, the corresponding
allocation queue will be automatically removed to avoid error loops. hq alloc events
command has been removed.- The
--max-kept-directories
parameter for allocation queues has been removed. HyperQueue will now keep
20
last allocation directories amongst all allocation queues.
Fixes
- HQ will no longer warn that
stdout
/stderr
path does not contain the%{TASK_ID}
placeholder
when submitting array jobs if the placeholder is contained within the working directory path and
stdout
/stderr
contains the%{CWD}
placeholder.
v0.9.0-rc3
HyperQueue 0.9.0-rc3
New features
Tasks
-
Task may be started with a temporary directory that is automatically deleted when the task is finished.
(flag--task-dir
). -
Task may provide its own error message by creating a file with name passed by environment variable
HQ_ERROR_FILENAME
.
CLI
- You can now use the
hq task list <job-selector>
command to display a list of tasks across multiple jobs. - Add
--filter
flag toworker list
to allow filtering workers by their status.
Changes
Automatic allocation
- Automatic allocation has been rewritten from scratch. It will no longer query PBS/Slurm allocation
statuses periodically, instead it will try to derive allocation state from workers that connect
to it from allocations. - When adding a new allocation queue, HyperQueue will now try to immediately submit a job into the queue
to quickly test whether the entered configuration is correct. If you want to avoid this behaviour, you
can use the--no-dry-run
flag forhq alloc add <pbs/slurm>
. - If too many submissions (10) or running allocations (3) fail in a succession, the corresponding
allocation queue will be automatically removed to avoid error loops. hq alloc events
command has been removed.- The
--max-kept-directories
parameter for allocation queues has been removed. HyperQueue will now keep
20
last allocation directories amongst all allocation queues.
Fixes
- HQ will no longer warn that
stdout
/stderr
path does not contain the%{TASK_ID}
placeholder
when submitting array jobs if the placeholder is contained within the working directory path and
stdout
/stderr
contains the%{CWD}
placeholder.
v0.9.0-rc2
HyperQueue 0.9.0-rc2
New features
Tasks
- Task may be started with a temporary directory that is automatically deleted when the task is finished.
(flag--task-dir
).
CLI
- You can now use the
hq task list <job-selector>
command to display a list of tasks across multiple jobs. - Add
--filter
flag toworker list
to allow filtering workers by their status.
Changes
Automatic allocation
- When adding a new allocation queue, HyperQueue will now try to immediately submit a job into the queue
to quickly test whether the entered configuration is correct. If you want to avoid this behaviour, you
can use the--no-dry-run
flag forhq alloc add <pbs/slurm>
. - The automatic allocator will now be invoked much less frequently, which should reduce stress put
on the used HPC job manager (e.g. PBS). You might thus see up to 10-minute delays before the HQ
allocation list will display updated information or before a new allocation will be submitted.
We plan to rework the automatic allocator in future versions to allow more frequent updates while
avoiding generating too many requests to the HPC job manager.
Fixes
- HQ will no longer warn that
stdout
/stderr
path does not contain the%{TASK_ID}
placeholder
when submitting array jobs if the placeholder is contained within the working directory path and
stdout
/stderr
contains the%{CWD}
placeholder. - The automatic allocator will query PBS allocation statuses less often. It will now ask for status
of all allocations per allocation queue in a singleqstat
call, and it now also contains backoff
that will slow down new allocations if there are submission errors. If too many submissions (10) or
running allocations (3) fail in a succession, its corresponding allocation queue will be automatically
removed.
v0.9.0-rc1
HyperQueue 0.9.0-rc1
New features
Tasks
- Task may be started with a temporary directory that is automatically deleted when the task is finished.
(flag--task-dir
)
CLI
- You can now use the
hq task list <job-selector>
command to display a list of tasks across multiple jobs. - Add
--filter
flag toworker list
to allow filtering workers by their status.
Changes
Automatic allocation
- When adding a new allocation queue, HyperQueue will now try to immediately submit a job into the queue
to quickly test whether the entered configuration is correct. If you want to avoid this behaviour, you
can use the--no-dry-run
flag forhq alloc add <pbs/slurm>
.
Fixes
- HQ will no longer warn that
stdout
/stderr
path does not contain the%{TASK_ID}
placeholder
when submitting array jobs if the placeholder is contained within the working directory path and
stdout
/stderr
contains the%{CWD}
placeholder. - The automatic allocator will query PBS allocation statuses less often. It will now ask for status
of all allocations per allocation queue in a singleqstat
call, and it now also contains backoff
that will slow down new allocations if there are submission errors. If too many submissions (50) or
allocations (10) fail in a succession, its corresponding allocation queue will be automatically removed.