Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parameter sweep run_model is hard to debug #1356

Open
Robbybp opened this issue Feb 27, 2024 · 6 comments
Open

Parameter sweep run_model is hard to debug #1356

Robbybp opened this issue Feb 27, 2024 · 6 comments
Assignees
Labels
discussion Discussion Priority:Normal Normal Priority Issue or PR

Comments

@Robbybp
Copy link
Member

Robbybp commented Feb 27, 2024

In ParameterSweepBase, we wrap run_model in a try/except with no specified exception. This seems a bit dangerous. For errors in solvers and other "black-box" subroutines, we probably want to record them. For other errors, e.g. typos, we probably want the sweep to fail immediately. Otherwise we can run an entire parameter sweep just to see something like ValueError: too many values to unpack for every item. The risk of odd errors increases as we write more complicated run_model methods. Right now I have scaling, re-initialization, solve, and validation steps in a run_model function that I'm using. My question is: Should we define an exception type that users must raise in order to be caught by the sweep runner? The downside is that users will have to know what types of exceptions to expect from their black-box subroutines, and will have to catch these and re-raise our exception type. I assume the current design was chosen intentionally, but am curious to hear the rationale for it.

@andrewlee94
Copy link
Member

@Robbybp I don't know that there was too much rationale behind the current design; it mostly echoes designs from previous iterations but I am not sure that there has been much thought into different ways it can go wrong. Having a terminate-immediately Exception type sounds like a good idea however; usage would primarily fall upon the user but that is probably how it has to be (i.e. it is up to them how much they want to instrument their method).

That being said, we probably should look at catching some of the obvious ones ourselves if possible. The ValueError example above strikes me as one we might be able to handle.

To consider other alternatives, there is the opposite approach in that we fail outright on any Exception and insist that users put in try/excepts as necessary (which they would need to do in the above anyway). However, that could be just as frustrating if you have a long run and late in the cycle you run into the one Exception you did not expect.

So, in short I think trying to find something in the middle might be he best approach. I.e. have an Exception users can use to indicate fail immediately (that they can use as they wish), add some more protections where we can to catch obvious issues like callback signatures with the wrong number of arguments, and otherwise just catch the Exception and log it in the results.

Adding the terminate immediately Exception is easy, but we should think about what cases we have where we want to fail immeidately (or better yet, fail before executing anything).

@andrewlee94
Copy link
Member

On re-reading the original message, we should also log any Exceptions we catch in the results for posterity.

@Robbybp
Copy link
Member Author

Robbybp commented Feb 27, 2024

To consider other alternatives, there is the opposite approach in that we fail outright on any Exception and insist that users put in try/excepts as necessary (which they would need to do in the above anyway). However, that could be just as frustrating if you have a long run and late in the cycle you run into the one Exception you did not expect.

This would have been my suggestion, but I see that it could be frustrating to get an unhandled exception near the end of a run. Maybe we could inspect the exception in our except block and re-raise if it is TypeError, ValueError, KeyError, IndexError, AttributeError, or any other error that we do not expect a solver to raise. I think this is basically what you're suggesting?

@Robbybp
Copy link
Member Author

Robbybp commented Feb 27, 2024

On re-reading the original message, we should also log any Exceptions we catch in the results for posterity.

We do this already, right?

@andrewlee94
Copy link
Member

@Robbybp My question then is what exceptions would we expect a solver to raise, or perhaps more specifically can we neatly categorize things into exceptions to catch and log versus those to re-raise? My initial feeling is that it would be better to have a "white-list" of exceptions to catch and log rather than a black-list of exceptions to terminate on - it will fail more aggressively, but would generally indicate a user-error that should be addressed. We could then clearly document the exception types that will be logged for the user.

Another thing however would be to think about how this might work in a parallel environment where we would ideally want a critical failure in one thread to kill all other threads as well (I think this is probably part of the reason for the liberal catch and log approach at the moment).

@Robbybp
Copy link
Member Author

Robbybp commented Feb 28, 2024

My initial feeling is that it would be better to have a "white-list" of exceptions to catch and log rather than a black-list of exceptions to terminate on

I think in this case I would advocate for a ParameterSweepRuntimeError or something that we require users raise if they want the exception to be caught. This way the user is in charge of knowing what exceptions the solver could raise. I'm fine with this approach, especially as I think this is the easiest thing for us to implement and maintain. I think re-raising specific non-runtime/arithmetic errors is more user-friendly (they don't have to implement their own try/except), but less safe (what if a solver raises ValueError internally). I think overall I prefer the former approach as well.

Another thing however would be to think about how this might work in a parallel environment where we would ideally want a critical failure in one thread to kill all other threads as well

I think this is out-of-scope for our current implementation. The parallel implementation can override the default, although I think I see the advantages of the current implementation in the parallel setting.

@ksbeattie ksbeattie added discussion Discussion Priority:Normal Normal Priority Issue or PR labels Feb 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Discussion Priority:Normal Normal Priority Issue or PR
Projects
None yet
Development

No branches or pull requests

3 participants