Skip to content

Add support for asynchronous workers #149

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Conversation

nefrathenrici
Copy link
Member

@nefrathenrici nefrathenrici commented Mar 13, 2025

Purpose

This PR enables workers to be added while a calibration is already underway.

To-do

  • Add redirect_stderr(stdout) to worker logger func

Content

  • Add an expr field to PBSManager and SlurmManager. This expr is evaluated just after the worker is initialized. This can be set via SlurmManager(nprocs; expr). This enables workers to be initialized and load code asynchronously, without relying on @everywhere.
  • Add method specialization for Distributed.create_worker(manager::Union{SlurmManager, PBSManager},...) in order to ensure the expression gets evaluated before any other code.
  • Changed run_worker_iteration to check for newly added workers for each forward model run. The function now gets initial workers, then makes a vector of work (fwd model calls). While this vector is not empty, we check for new workers, and if workers are available we assign one unit of work to the worker. If no workers are available, wait for workers

Previous workflow:

addprocs(SlurmManager(nprocs)...)
@everywhere
    include(model_interface_file)
end

calibrate(WorkerBackend, ...)

New workflow:

expr = quote
    include(model_interface_file)
end
@async addprocs(SlurmManager(nprocs; expr)...)

calibrate(WorkerBackend, ...)

  • I have read and checked the items on the review checklist.

@nefrathenrici nefrathenrici force-pushed the ne/async_workers branch 4 times, most recently from f5715f7 to 6dba93a Compare March 13, 2025 23:35
@nefrathenrici nefrathenrici linked an issue Mar 14, 2025 that may be closed by this pull request
@nefrathenrici nefrathenrici requested a review from Sbozzolo March 14, 2025 16:28
@nefrathenrici nefrathenrici force-pushed the ne/async_workers branch 2 times, most recently from 7b8d0ce to 9c69284 Compare March 14, 2025 18:35
@@ -406,6 +456,7 @@ This function should be called from the worker process.
"""
function set_worker_logger()
@eval Main using Logging
redirect_stderr(stdout)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incorrect and should be redirect_stderr(io)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ensure slurm/pbs array jobs are caught asynchronously
1 participant