Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to create new block per task #3621

Open
drewoldag opened this issue Sep 19, 2024 · 2 comments
Open

Ability to create new block per task #3621

drewoldag opened this issue Sep 19, 2024 · 2 comments

Comments

@drewoldag
Copy link

Is your feature request related to a problem? Please describe.
The HPC that I have access uses a "condo model" and Slurm such that I can request resources that are not currently in use, and if the owner of those resources requests them while I'm still using them, I will be preempted on short notice.

We would like to be able to instruct Parsl to scale in the block that was used to run a task, and then scale out another block for the next task in line. What we have observed is that a block will be scaled out and multiple tasks will be executed sequentially on the block until 1) there are no more tasks in the queue or 2) the walltime for the provider is exceeded.

It seems like there must be a way to configure the system to do this, but I have been unable to find it in the documentation.

Describe the solution you'd like
Again, I would guess that this functionality already exists, but I'm assuming I've just missed it. Ideally there's a parameter in either the executor or provider that would allow me to instruct parsl to spin down the resources requested after a task completes.

Additional context
Here's a snippet of my current configuration file. Here we're expecting at most 10 blocks to be provisioned, with 1 node/block and 1 worker/node.

HighThroughputExecutor(
    # This executor is used for reprojecting sharded WorkUnits
    label="sharded_reproject",
    max_workers=1,
    provider=SlurmProvider(
        partition="ckpt-g2",
        account="astro",
        min_blocks=0,
        max_blocks=10,
        init_blocks=0,
        parallelism=1,
        nodes_per_block=1,
        cores_per_node=8,
        mem_per_node=32,
        exclusive=False,
        walltime="01:00:00",
        worker_init="",
    ),
),
@benclifford
Copy link
Collaborator

Hi.

I don't immediately see how your second paragraph follows on from the first, so I'll structure my comments here not making that assumption.

First, I'm interested what you're trying to achieve by having your batch jobs not last very long - they're going to get preempted anyway(?) so is there some other scheduling going on here that makes short batch jobs more desirable? Can you describe that a bit more? (eg. as an example question, what benefit do you get from 2 tasks in 2 blocks vs 2 tasks in 1 block?)

The block scaling code is really distinct from the high throughput executor task execution code: that's deliberate because the pilot jobs in the batch system really are meant to be distinct from individual tasks - so this isn't a thing Parsl features have been driving towards (and probably not a thing we're super interested in pursuing without some more detailed justification)

You could hack the process worker pool (process_worker_pool.py) to shut down after one task, if you want to try out that behaviour - Parsl submit-side will notice and scale up another pool to do more work (eventually). I think @ryanchard has had some experience doing that in the past - but actually I think what he wanted turned out to be more like workers that only last for a short time period and drain after that (without any particular concern for the task structure inside that short time period), and for that he ended up using the htex worker drain time options (see drain_period in https://parsl.readthedocs.io/en/latest/stubs/parsl.executors.HighThroughputExecutor.html)

@DinoBektesevic
Copy link

Hello Ben,

thanks for the response

First, I'm interested what you're trying to achieve by having your batch jobs not last very long - they're going to get preempted anyway(?) so is there some other scheduling going on here that makes short batch jobs more desirable?

Yes, you are spot on. The issue is not the condo model, the issue is that we would like to utilize what is called a checkpoint queue. Checkpoint queue allows access to all of the resources at the cluster currently not being utilized. In effect we are granted access to resources above what we were allocated as reserved resources. However, should the owner of the resources we were given via the checkpoint queue request them back - our jobs are terminated and the resources are re-allocated ("preempted") to them.
I believe the idea is to allow many users of the cluster access to a large amount of resources in parallel, but comes at a cost that the jobs they may want to run on them need to be a) short, b) be able to save their state and fail gracefully on a moment's notice.

We are able to achieve a high level of parallelism, with reasonably short individual jobs (~ten thousands on the order of 5-7 minutes) but some of our jobs take 20 minutes (thousands), some take a few hours (few thousand). The duration of a job is the function of the data that the job takes and unfortunately can not be predicted in advance (not which data the job takes, but whether the data will take a long time to process). The time-to-preemption depends on the utilization of the cluster at any given time, so sometimes we are able to retain a worker for a few hours and sometimes at best 10 minutes.

We only have access to few of more permanent resources, so, if we were to configure an executor to only use those resources, they represent a bottleneck in terms of throughput for us. We believe we would benefit from utilizing the checkpoint queue to scale out to deal with majority of our jobs.

The block scaling code is really distinct from the high throughput executor task execution code: that's deliberate because the pilot jobs in the batch system really are meant to be distinct from individual tasks [...]

Separating resource acquisition and retention from execution makes sense and generally works well for us, especially in environments where usage is credit-based for example. This is what I believe is happening to us on the resources we are currently using:

  • we get a lot of blocks of the checkpoint queue with HighThroughputExecutor
  • they execute a task or two and then get preempted, failed task get scheduled for a retry
  • task is paired with an executor and if it's freshly allocated "all is well", but when it's an pre-existing resource the task fails when the node gets preempted again because the executor has occupied it for a while already and the owner requested it back.

We can put a large retry number on our tasks, but then we find that often both the shorter and longer running tasks just repeatedly keep failing and blocking the other tasks in the queue. We thought initially, if we do not increase the retry counter when preemption happens, maybe we can keep the retry counter low, so when a task fails let's say 2 times, we know it's just a bad task or a task not fit for the checkpoint queue. We can't seem to figure out how to not increase the retry counter in case of a preemption though.

Ultimately, we end up with a long queue of retry jobs, some of which are long jobs that could not finish on checkpoint queue anyhow, some of which will always fail because data is bad, and some that just keep getting preempted over and over again until they "luck out". And maybe that's fine, but we figured that all the tasks that were retried, only to be preempted or legitimately fail again, are just wasted resources and it does make our book-keeping and log management rather cumbersome.

We were looking for a better way forward, something that may execute optimistically for the natural duration of the task itself and if it gets preempted, it gets retried only once or twice. All the other failures are considered for a new processing campaign on a different queue, one that is more permanent, or as a legitimate failure due to poor data.

I may have butchered a number of nomenclature related things in this section, I apologize, hopefully what I meant is clear.

but actually I think what he wanted turned out to be more like workers that only last for a short time period and drain after that (without any particular concern for the task structure inside that short time period)

I'm trying to read into the advice you had in your reply and this sounds like something we could also do? But I'm not an Parsl expert (yet) so I just want to clarify something that is confusing me:

  • Is draining an executor the same as reaching walltime for a provider?
  • And what is "maximum walltime of batch jobs"?

If I set drain period to, let's say, ~5 minutes and my jobs are 7 minutes, then I can expect Parsl to schedule 1 job per executor block, because by the time the job finishes the block was already "marked to be drained". The block would go away after the tasks has finished and a new one would be launched depending on the queue and scaling configuration? Or would the executor interrupt the task and drain at 5 minutes + some overhead it takes Parsl to communicate that to the executor?
If my job takes (some significant amount) more than 7 minutes, let's say 14 minutes, the executor would force the task to stop, drain, and replace itself with a new executor block based on the max_blocks and parallelism values? Or would the block continue executing until that job completes? I think this is the same question, just the other way around.

I don't think we care for "the task structure inside that short time period"? We have 3 consecutive tasks that's true, but they just have to execute in order to produce expected artifacts for subsequent tasks, each "batch" of these jobs is basically embarrassingly parallel wrt. to other tasks in the batch so we could launch the three tasks consecutively, as separate workflows, ourselves?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants