Add support for asynchronous workers #149

nefrathenrici · 2025-03-13T22:48:38Z

Purpose

This PR enables workers to be added while a calibration is already underway.

To-do

Add redirect_stderr(stdout) to worker logger func

Content

Add an expr field to PBSManager and SlurmManager. This expr is evaluated just after the worker is initialized. This can be set via SlurmManager(nprocs; expr). This enables workers to be initialized and load code asynchronously, without relying on @everywhere.
Add method specialization for Distributed.create_worker(manager::Union{SlurmManager, PBSManager},...) in order to ensure the expression gets evaluated before any other code.
Changed run_worker_iteration to check for newly added workers for each forward model run. The function now gets initial workers, then makes a vector of work (fwd model calls). While this vector is not empty, we check for new workers, and if workers are available we assign one unit of work to the worker. If no workers are available, wait for workers

Previous workflow:

addprocs(SlurmManager(nprocs)...)
@everywhere
    include(model_interface_file)
end

calibrate(WorkerBackend, ...)

New workflow:

expr = quote
    include(model_interface_file)
end
@async addprocs(SlurmManager(nprocs; expr)...)

calibrate(WorkerBackend, ...)

I have read and checked the items on the review checklist.

… initialization order

…n initialization

nefrathenrici · 2025-03-19T18:53:27Z

src/workers.jl

@@ -406,6 +456,7 @@ This function should be called from the worker process.
 """
 function set_worker_logger()
    @eval Main using Logging
+    redirect_stderr(stdout)


This is incorrect and should be redirect_stderr(io)

nefrathenrici force-pushed the ne/async_workers branch 4 times, most recently from f5715f7 to 6dba93a Compare March 13, 2025 23:35

nefrathenrici linked an issue Mar 14, 2025 that may be closed by this pull request

Ensure slurm/pbs array jobs are caught asynchronously #145

Open

nefrathenrici requested a review from Sbozzolo March 14, 2025 16:28

nefrathenrici force-pushed the ne/async_workers branch from 6dba93a to ed12921 Compare March 14, 2025 16:42

AlexisRenchon mentioned this pull request Mar 14, 2025

Calibration of land models CliMA/ClimaLand.jl#1049

Closed

35 tasks

nefrathenrici force-pushed the ne/async_workers branch 2 times, most recently from 7b8d0ce to 9c69284 Compare March 14, 2025 18:35

nefrathenrici added 2 commits March 19, 2025 10:20

Add optional initial evaluation expression for workers

4e14c97

Add new workers to pool when assigning work; known issues with worker…

684f06f

… initialization order

nefrathenrici force-pushed the ne/async_workers branch from 9c69284 to ab1b7ef Compare March 19, 2025 17:26

Invalidate Distributed.create_worker to execute custom expression o…

9854b9e

…n initialization

nefrathenrici force-pushed the ne/async_workers branch from ab1b7ef to 9854b9e Compare March 19, 2025 18:03

nefrathenrici commented Mar 19, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support for asynchronous workers #149

Add support for asynchronous workers #149

Uh oh!

nefrathenrici commented Mar 13, 2025 •

edited

Loading

Uh oh!

nefrathenrici Mar 19, 2025

Uh oh!

Uh oh!

Add support for asynchronous workers #149

Are you sure you want to change the base?

Add support for asynchronous workers #149

Uh oh!

Conversation

nefrathenrici commented Mar 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

To-do

Content

Uh oh!

nefrathenrici Mar 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nefrathenrici commented Mar 13, 2025 •

edited

Loading