Move hash computation so that it is recomputed on retry, and now-inva… #258

alecw · 2023-08-24T20:16:49Z

…lid checkpoint is not loaded.

If number of tries is exhausted, and ELBO tests are still failing, allow to complete anyway (using checkpoint) so that outputs are produced, but exit(1).

…lid checkpoint is not loaded. If number of tries is exhausted, and ELBO tests are still failing, allow to complete anyway (using checkpoint) so that outputs are produced, but exit(1).

sjfleming

Looks good, thank you! One question about the final re-run strategy

sjfleming · 2023-08-25T19:25:07Z

cellbender/remove_background/run.py

+                         # The following settings do not affect the results, and can change when retrying,
+                         # so remove them.
+                         'epoch_elbo_fail_fraction', 'final_elbo_fail_fraction',
+                         'num_failed_attempts', 'checkpoint_filename']


Thank you! These had been overlooked by me

sjfleming · 2023-08-25T19:26:46Z

cellbender/remove_background/run.py

@@ -771,7 +789,12 @@ def run_inference(dataset_obj: SingleCellRNACountsDataset,
                sys.exit(0)
            else:
                logger.info(f'No more attempts are specified by --num-training-tries. '
-                            f'Therefore the workflow will abort here.')
+                            f'Therefore the workflow will run once more without ELBO restrictions.')


Do you want to re-run it? Does this re-run "cache", i.e. use the checkpoint? Or does it actually retrain the whole thing?

Yes, after the changes to the elements that are included in the hash computation, this uses the most recent checkpoint.

Awesome, sounds perfect

sjfleming · 2023-08-25T19:27:10Z

cellbender/remove_background/run.py

+                args.final_elbo_fail_fraction = None
+                run_remove_background(args)  # start from scratch
+                # non-zero exit status in order to draw user's attention to the fact that ELBO tests
+                # were never satisfied.
                sys.exit(1)


I can see the value of still wanting to exit 1 so that the run will be flagged as a failure

sjfleming · 2023-08-25T20:54:04Z

This looks good to me! Thanks for contributing it.

I think you're the main person using this functionality, so if the behavior gives you what you want in your testing, I'm happy to merge this. Unfortunately I don't have any unit tests written yet for this kind of retry functionality... some day!

alecw · 2023-08-26T00:18:31Z

Hi @sjfleming ,

Yes, it appears to be working fine, so please go ahead and merge.

Thanks, Alec

* Add WDL input to set number of retries. (#247) * Move hash computation so that it is recomputed on retry, and now-invalid checkpoint is not loaded. (#258) * Bug fix for WDL using MTX input (#246) * Memory-efficient posterior generation (#263) * Fix posterior and estimator integer overflow bugs on Windows (#259) * Move from setup.py to pyproject.toml (#240) * Fix bugs with report generation across platforms (#302) --------- Co-authored-by: kshakir <[email protected]> Co-authored-by: alecw <[email protected]>

Move hash computation so that it is recomputed on retry, and now-inva…

3c8cb38

…lid checkpoint is not loaded. If number of tries is exhausted, and ELBO tests are still failing, allow to complete anyway (using checkpoint) so that outputs are produced, but exit(1).

alecw force-pushed the aw_retry_hashcode branch from 95baec7 to 3c8cb38 Compare August 24, 2023 20:18

sjfleming reviewed Aug 25, 2023

View reviewed changes

sjfleming approved these changes Aug 25, 2023

View reviewed changes

sjfleming added the bug Something isn't working label Aug 28, 2023

mbabadi approved these changes Oct 24, 2023

View reviewed changes

sjfleming merged commit 7a834a4 into broadinstitute:dev Oct 31, 2023
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move hash computation so that it is recomputed on retry, and now-inva… #258

Move hash computation so that it is recomputed on retry, and now-inva… #258

alecw commented Aug 24, 2023

sjfleming left a comment

sjfleming Aug 25, 2023

sjfleming Aug 25, 2023

alecw Aug 25, 2023

sjfleming Aug 25, 2023

sjfleming Aug 25, 2023

sjfleming commented Aug 25, 2023

alecw commented Aug 26, 2023

Move hash computation so that it is recomputed on retry, and now-inva… #258

Move hash computation so that it is recomputed on retry, and now-inva… #258

Conversation

alecw commented Aug 24, 2023

sjfleming left a comment

Choose a reason for hiding this comment

sjfleming Aug 25, 2023

Choose a reason for hiding this comment

sjfleming Aug 25, 2023

Choose a reason for hiding this comment

alecw Aug 25, 2023

Choose a reason for hiding this comment

sjfleming Aug 25, 2023

Choose a reason for hiding this comment

sjfleming Aug 25, 2023

Choose a reason for hiding this comment

sjfleming commented Aug 25, 2023

alecw commented Aug 26, 2023