Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azurize CellBender to run on ToA #367

Merged
merged 5 commits into from
Jun 24, 2024

Conversation

aawdeh
Copy link
Contributor

@aawdeh aawdeh commented Jun 13, 2024

We don't have GPU access in ToA and we want to run Multiome (with CellBender) on Azure.

cellbender_remove_background_azure.wdl is a copy of cellbender_remove_background.wdl without the CUDA options for CellBender

Azurized CellBender so that it runs in Multiome by (1) removing the --cuda option as a CellBender input parameter and (2) removing runtime variables for GPU access when on Azure.

@aawdeh aawdeh changed the title Azurize Cell Bender to run on ToA Azurize CellBender to run on ToA Jun 13, 2024
@sjfleming
Copy link
Member

Looks good to me! Thanks.

@sjfleming
Copy link
Member

One other thing to mention is that, per Janet Gainer-Dewar, the use of preemptible machines on Azure might not make sense currently. It sounds like there is no support for "checkpointFile" on Azure yet.

So if you run with hardware_preemptible_tries > 0, then you risk never actually finishing. CellBender will not pick up from its last checkpoint (since Terra on Azure doesn't support "checkpointFile"), and Janet said that they do not retry on a non-preemptible machine for the last run. So it might just fail if it continues to get preempted. I just worry about it a little because the run is bound to take a long time on CPU.

@sjfleming sjfleming changed the base branch from master to dev June 21, 2024 18:39
@sjfleming sjfleming changed the base branch from dev to master June 21, 2024 18:41
@sjfleming sjfleming changed the base branch from master to dev June 21, 2024 18:43
@aawdeh
Copy link
Contributor Author

aawdeh commented Jun 21, 2024

One other thing to mention is that, per Janet Gainer-Dewar, the use of preemptible machines on Azure might not make sense currently. It sounds like there is no support for "checkpointFile" on Azure yet.

So if you run with hardware_preemptible_tries > 0, then you risk never actually finishing. CellBender will not pick up from its last checkpoint (since Terra on Azure doesn't support "checkpointFile"), and Janet said that they do not retry on a non-preemptible machine for the last run. So it might just fail if it continues to get preempted. I just worry about it a little because the run is bound to take a long time on CPU.

That makes sense. I could set hardware_preemptible_tries to 0 in the Azure run. What do you think? I can go ahead and make that change.

I was also wondering if it would make sense to also remove maxRetries from the runtime variables?

@sjfleming
Copy link
Member

Setting hardware_preemptible_tries to 0 by default sounds like a good idea to be conservative. Yeah maxRetries > 0 is usually used on GCP to overcome its PAPI error code 2 when it fails (inexplicably) to install GPU drivers. Retrying makes sense in that one limited case, since a retry can succeed. But most of the time, a retry does not make sense... if it was a cellbender failure, it will fail again on retry. But I guess it makes sense to leave that argument with a 0 default.

Copy link
Member

@sjfleming sjfleming left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I will merge

@sjfleming sjfleming merged commit 2c3dbe8 into broadinstitute:dev Jun 24, 2024
4 checks passed
@aawdeh
Copy link
Contributor Author

aawdeh commented Jun 24, 2024

Looks good, I will merge

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants