Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move hash computation so that it is recomputed on retry, and now-inva… #258

Merged
merged 1 commit into from
Oct 31, 2023

Conversation

alecw
Copy link
Contributor

@alecw alecw commented Aug 24, 2023

…lid checkpoint is not loaded.

If number of tries is exhausted, and ELBO tests are still failing, allow to complete anyway (using checkpoint) so that outputs are produced, but exit(1).

…lid checkpoint is not loaded.

If number of tries is exhausted, and ELBO tests are still failing, allow to complete anyway (using checkpoint) so that outputs are produced, but exit(1).
Copy link
Member

@sjfleming sjfleming left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thank you! One question about the final re-run strategy

Comment on lines +71 to +74
# The following settings do not affect the results, and can change when retrying,
# so remove them.
'epoch_elbo_fail_fraction', 'final_elbo_fail_fraction',
'num_failed_attempts', 'checkpoint_filename']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! These had been overlooked by me

@@ -771,7 +789,12 @@ def run_inference(dataset_obj: SingleCellRNACountsDataset,
sys.exit(0)
else:
logger.info(f'No more attempts are specified by --num-training-tries. '
f'Therefore the workflow will abort here.')
f'Therefore the workflow will run once more without ELBO restrictions.')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to re-run it? Does this re-run "cache", i.e. use the checkpoint? Or does it actually retrain the whole thing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, after the changes to the elements that are included in the hash computation, this uses the most recent checkpoint.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, sounds perfect

args.final_elbo_fail_fraction = None
run_remove_background(args) # start from scratch
# non-zero exit status in order to draw user's attention to the fact that ELBO tests
# were never satisfied.
sys.exit(1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see the value of still wanting to exit 1 so that the run will be flagged as a failure

@sjfleming
Copy link
Member

This looks good to me! Thanks for contributing it.

I think you're the main person using this functionality, so if the behavior gives you what you want in your testing, I'm happy to merge this. Unfortunately I don't have any unit tests written yet for this kind of retry functionality... some day!

@alecw
Copy link
Contributor Author

alecw commented Aug 26, 2023

Hi @sjfleming ,

Yes, it appears to be working fine, so please go ahead and merge.

Thanks, Alec

@sjfleming sjfleming added the bug Something isn't working label Aug 28, 2023
@sjfleming sjfleming merged commit 7a834a4 into broadinstitute:dev Oct 31, 2023
3 checks passed
sjfleming added a commit that referenced this pull request Oct 31, 2023
* Add WDL input to set number of retries. (#247)

* Move hash computation so that it is recomputed on retry, and now-invalid checkpoint is not loaded. (#258)

* Bug fix for WDL using MTX input (#246)

* Memory-efficient posterior generation (#263)

* Fix posterior and estimator integer overflow bugs on Windows (#259)

* Move from setup.py to pyproject.toml (#240)

* Fix bugs with report generation across platforms (#302)

---------

Co-authored-by: kshakir <[email protected]>
Co-authored-by: alecw <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants