[DONE] Should we reconsider our error types? #460

olivereanderson · 2021-10-27T21:43:22Z

olivereanderson
Oct 27, 2021

Should we reconsider our error approach:

Introduction

Users with little experience with Rust may want to first read the background section.

Currently each of our crates introduce a single error enum with (often) many different variants that cover any possible way any single method exposed in the crate can fail. This makes it very easy to compose functions and return early with the try operator ?, but also leads to a large cognitive burden on our users as they have to think about which errors will actually occur and how to handle them. When there are so many variants to consider it is tempting for library users to match on the variants they understand and/or expect and use a wild card for the the rest. The creator of the sled crate has stated that such things indeed do happen, and they can cause bugs over time. Here is a quote from the following blog post (thanks @PhilippGackstatter for pointing this one out):

Dozens and dozens of bugs happened over years of development where the underlying issue boiled down to either accidentally using the try ? operator somewhere that a local error should have been handled, or by performing a partial pattern match that included an over-optimistic wildcard match.

Furthermore it is also hard for us to maintain these huge error enums. The fact that https://github.com/iotaledger/identity.rs/blob/dev/identity-account/src/error.rs#L14 and https://github.com/iotaledger/identity.rs/blob/dev/identity-account/src/error.rs#L95 both are variants of the same enum, and the former has an InvalidPrivateKey variant of its own, suggests that we are already struggling. Finally having to think about the possibility of loads of values that may or may not exist does not feel right in a statically typed language like Rust.

These are the reasons we should reconsider how we go about error handling in this project. The purpose of this discussion is to try to establish guidelines for how our library should present errors.

Some background on how errors are represented in Rust

This section is mainly meant to give developers who mostly use the javascript bindings (whose input we definitely also value) a better understanding of what is being discussed. Unfortunately we cannot explain every aspect of error handling in Rust and must settle for a brief overview.

Representing errors in Rust

The book Rust for Rustaceans by Jon Gjengset states that there are two main options for representing errors: Enumeration or erasure.

An example of the former:
We have a service which users can log in to. There are several different ways in which this operation can fail for a given user so we enumerate them:

enum LoginError {
    UserNotFound(String), // Keep the username so it can be displayed 
    IncorrectPassword,
    DatabaseConnectionLost,
}

And the corresponding function would look something like

fn login(username: &str, password: &str) -> Result<(),LoginError>

Whenever this function fails we can find out exactly how and act accordingly;
If the database connection is lost we may wish to re-establish it and retry, if the password is typed wrong 10 times we may wish to deactivate the account for the time being, if the username is not found we tell the user "we could not find any use with the password they provided".

Suppose now instead that we are making a library for compressing and decompressing arbitrary bytes. Note that callers of the (de)compression function are not likely to understand or care exactly how our function failed, hence it is enough for them to just know that an error did occur so they can log it if they wish to do so. Thus in this scenario we may provide an opaque error type like CompressionError or even be more vague and just state that our function returns something resembling an error (an Error trait object, which is a bit like an interface in languages with a stronger emphasis on OOP).

The Try operator

Suppose we have a function that loads a config file.
This operation can fail in several ways, such as the config file was not found, or it was corrupted etc, so the function signature would look like something like this.

    fn load_config() -> Result<Config, LoadingError>

We also want a function that can restore certain settings back to their defaults restore_default_settings. Suppose that the default settings can be found in the config file, then restore_default_settings would typically call load_config internally to get the config. For this reason we expect restore_default_settings to fail if the call to load_config fails and it would actually be super convenient if there was no other way for this operation to fail as then we could implement it in the following way:

    fn restore_default_settings() -> Result<(), LoadingError> {
        let config = load_config()?; 
        // do something with the config 
    }

here the ? operator means if load_config failed then return that error now and don't proceed any further. If restore_default_settings could fail in other ways in addition to the config not being able to load, then we would have to do some more programming in order to convert the LoadingError to the appropriate one. This is why it is so tempting to have a single error type, because then it becomes super easy to compose functions. The problem with this approach is that we don't help the function caller much in handling distinct errors. Even with the enumeration approach the enumeration can get so long that it becomes very hard for users to understand how a given function actually can fail, as it is unlikely that all the functions in our library can fail in exactly the same way. In the creator of this discussions opinion this is a problem we now need to tackle in this project.

More advanced ways of representing errors in Rust

In the previous section we said that enumeration and erasure are the two main options we have for representing errors in Rust, but there is nothing stopping us from combining these two ways. Here is an example (again thanks @PhilippGackstatter )

pub struct StorageError {
  inner: Box<dyn std::error::Error>,
  kind: StorageErrorKind
}

pub enum StorageErrorKind {
  NotFound,
  WriteError,
  ReadError,
  ConnectionLost,
  Timeout,
}

The StorageErrorKind gives a conceptual explanation for what went wrong, if a Timeout or ConnectionLost variant is discovered a retry might be suitable, but if a NotFound or WriteError occurs this might be so serious that the application might want to shut itself down and notify someone (by email) to fix their storage device. For debugging purposes the StorageError also keeps the original error (represented as a trait object on the heap) in the inner field that can either be printed, or one could attempt to cast it down to a more specific error, if more information is needed.

Another interesting way of representing errors can be found in this PR from the stronghold team where the implementation is roughly

#[derive(DeriveError, Debug, Clone)]
pub enum Error<E: Debug + Display = Infallible> {
    #[error("Error sending message to Actor: `{0}`")]
    ActorMailbox(#[from] MailboxError),

    #[error("Target Actor has not been spawned or was killed.")]
    ActorNotSpawned,

    #[error("`{0}`")]
    Inner(E),
}

And then the methods return things like Result<(), Error<WriteSnapshotError>> and Result<(), Error<ReadSnapshotError>>. In short (from what we understand) one gets method specific errors for methods that can fail in ways shared with others without having to duplicate the code for those common cases.

What are the best practices for error handling in Rust as of today?

The following quote is from the book Rust for Rustaceans :

It’s worth noting that best practices for error handling in Rust are still
an active topic of conversation, and at the time of writing, the ecosystem
has not yet settled on a single, unified approach.

Although this is true, it might be a good idea to look at the patterns from stdlib pointed out by Steve Klabnik in this discussion:

1.No single error type for the whole stdlib

2.Some modules have a single error type, like std::io::Error. This is convenient, but also leads to some functions not being able to return all errors, which is not the best

3. Some functions have an error type for them only, like https://doc.rust-lang.org/stable/std/env/enum.VarError.html

Note also that although the io ErrorKind has 40 variants, it is indeed an error for a module and not an entire crate and Withoutboats mentions in the aforementioned discussion that io::Error is perhaps really a very unique case.

In the future error handling might become easier in Rust. There are several RFCs and discussions where people wish for anonymous sum types and they point out more ergonomic granular error handling as an important use case. Whether anonymous sum types eventually will be added to Rust is an open question, but the powerset enum crate can already be very helpful in this regard. Unfortunately that crate is only available on nightly Rust at the moment.

One possible Suggestion for error guidelines in identity.rs

Here is a suggestion for some guidelines we could follow when it comes to how we represent errors/failure in our project:

We stop having monster enum error types. But could consider having some global opaque error types to use for failures that we don't expect the caller to handle in detail.
For functions in the same module that are expected to be composed, we can use a common error enum defined in that module.
High-level functions that implement a known protocol capable of failing in different ways, some of which the caller would want to handle differently, should have their own dedicated error enum types parsable by humans.

Everyone is encouraged to keep the following quote from Yuan et.al in mind throughout this discussion:

almost all (92%) of the catastrophic system failures
are the result of incorrect handling of non-fatal errors
explicitly signaled in software.

elenaf9 · 2021-10-28T09:48:34Z

elenaf9
Oct 28, 2021

Note: I am not really familiar with the identity.rs codebase, so please take everything with a grain of salt.

Great comment @olivereanderson!
I'd like to add some additional suggestions for handling errors from the underlying libraries that you consume, e.g. stronghold.rs.
I think the convenience of single error enum types makes it very tempting to just blindly warp the error without handling it. Especially when you yourself are a library, it is the "cheapest" option to not bother with the type of error and pass the responsibility to your user/ whoever ends up writing the final application. But if every library is doing it for the libraries it consumes, you end up with a monster hierarchy of error enums, where it is simply impossible for the end-user to decide how to handle each error.
So in additional what you already wrote above about how you would implement your own error types, I'd like to suggest the following for handling errors from libraries that you consume: Handle as much errors as possible yourself and reduce the ones you bubble up to the user.

Dismiss error variants where you know for sure that they can never happen. E.g. when with Stronghold you never call kill_stronghold on the current actor target, or if you do you immediately switch to a new client-target, you can safely dismiss the case of Error::ActorNotSpawned. Since you are directly integrating stronghold, you have much more insights in which errors can actually happen and which can't, whereas for your user this is very hard if they don't want to read through your whole codebase.
Abstract errors that are too detailed for the user to even know what the issue is. In some cases this may mean to just trim a whole low-level error and just have a String instead with a debug message, in other cases you may only want to join certain variants into one, and leave others depending on whether it is relevant for the user or not.
Make decisions and react on all errors where you have the chance to take action. E.g. If in Stronghold an operation failed because of the client actor has shut down or is not responding, you may want to decide to just restart stronghold and load to the most recent snapshot back into the system. That's a decision that you have to make, since for your user the whole logic around stronghold is hidden in the identity-account and they may not even have the necessary means to simply restart stronghold, but instead would be forced to shutdown the whole identity account.

This implies of course that you add your own errors types at least for the larger libraries that you consume. But I think that since you are the one that is writing the logic to integrate the underlying library, you are in a much better position to classify the errors than your user is.

1 reply

olivereanderson Oct 28, 2021
Author

@elenaf9 Those are very good additions! Moreover since this PR iotaledger/stronghold.rs#269 makes the ways in which stronghold can fail more explicit, it will be much easier for us to implement the points (1-3) that you stated above :)

PhilippGackstatter · 2021-10-29T10:26:48Z

PhilippGackstatter
Oct 29, 2021

Thanks for the nice write-up @olivereanderson!

I'm focusing on the account error enum, and even there just the non-wrapped variants to keep a reasonable scope. This also repeats parts of what @elenaf9 already said.

Fatal vs. Local

What is fatal and local is not always very obvious (aka error handling is hard). Taking a closer look at our account error enum, it seems that we avoided making that distinction in the past and simply enumerated all possible errors, be they fatal or local. If we want to trim that down, we need to make that distinction and so we need a way to distinguish between the two.

A fatal error indicates that an invariant has been broken, it is unrecoverable. These are invariants that we define in the components we write. As an example, if we wrote a key into stronghold, then we expect it to be there next time if that's how our component is designed. If it's not there, that's a fatal error.
For lack of a better definition, a local error is one that is not a fatal error. A deviation from the happy path where no invariant has been broken. One a user can handle in the general case, without shutting down the entire application. (This definition is unsatisfactory for making a proper distinction between both; can someone come up with a better definition?). An example is a NotFound error, i.e. a user provided the wrong input. That's not fatal. An application wants to handle that error typically by letting the user know what they wanted to lookup was not found, but they can retry. No invariant has been broken, things are typically recoverable. Of course, an application building on top may have been designed with the invariant that something should've been there, but wasn't. In that case, that's a fatal error for them, but not for our component. (See also @elenaf9's first point.)

The idea would be to add a FatalError to our library (intentionally vague, see later sections) that is a catch all for fatal errors. Since users won't match on them, we can summarize them into this type. Concrete examples from the account enum are

StrongholdMutexPoisoned, SharedReadPoisoned, SharedWritePoisoned: These only occurs when someone else panicked while holding the lock. We shouldn't panic (unwrap), but it's okay to return a fatal error to signal that things have gone terribly wrong.
GenerationOverflow, GenerationUnderflow, IdentityIdOverflow: (this last one should have been removed in PR 436 anyway). We can specify that an invariant of the account is, that the number of identities stored is always <= X. Also, how often will these errors occur? Should they pollute our public error interface?
KeyPairNotFound: The account fully manages keys for the user. Thus, if a key pair is suddenly no longer to be found, this is a broken invariant and it's a fatal error.
InvalidResourceIndex: "Caused by attempting to parse an invalid Stronghold resource index." The code that returns this error looks a lot like a check for an invariant. Thus, broken invariant should be a FatalError.

More debatable variants are:

IdentityInUse: This is returned if a user loads two accounts that manage the same identity. We prohibit that to prevent concurrent updates that potentially overwrite each other. It's debatable whether an application could encounter this without it being fatal. We probably cannot make that assumption in the general case (did I mention error handling was hard?), and by the definition of local errors, loading an identity for the second time is not breaking an invariant. It's more akin to how a lookup can fail, so I would count this as a local error instead.

Some variants we should keep as local errors:

Most *NotFound errors, e.g. IdentityNotFound, MethodNotFound, ServiceNotFound, since they are common and not fatal. Generally though, I think lookup methods should return Option<T> and a higher-level function that calls these should map the None to an Err.
InvalidPrivateKey (disregarding the duplication with the crypto crate variant): It's erroneous user input. No invariant is broken, and it's not considered fatal.

Internal vs. External

Separate more clearly between internal and external errors. This boils down to more distinct error types, less wrapping, and more mapping. Basically what @elenaf9 said, applied to our situation.

Take the MemStore for example. Because of it, the account error has a KeyVaultNotFound variant. This error would be more appropriate to be a variant of MemStoreError (which doesn't currently exist). Since the account manages the storage for us, if a key vault is suddenly no longer found, is that also a broken invariant? If it can occur when a user looks up an identity that doesn't exist, then this error should be mapped appropriately to IdentityNotFound or a None, depending on the return type. The point is that the KeyVaultNotFound variant is not part of our public interface.
A similar reasoning applies to EventNotFound. Events are not a concept a user is familiar with since the account handles it for them. Thus having this in the error variant only confuses a developer, and it is likely to be handled in a wild card. It should either be mapped to a proper external error that a user can understand, or it may indicate a broken invariant, in which case Fatal is the way to go.

Convenience vs. explicitness

I want to preface this: None of these options address the fact that functions have an error in their signature that has all the variants even though the function may actually return just a subset of it. I don't know how to handle the combinatorial explosion of many possible unique subsets that functions can return, but would like to be proven wrong. I know stronghold did just that in a way, but I'm unsure if it applies to the account. Perhaps someone wants to explore that option? I primarily see a convenience issue, i.e. having to add lots of From implementations for all possible error types. I think it's more important to distinguish between local and fatal errors than be explicit about what exact errors are possible.

There are at least two options of putting this distinction into code. One is to take the current account enum, reduce the variants according to the above, and add a Fatal variant. Users are expected to match on all but the fatal variant to handle local errors, and handle the fatal variant as they deem appropriate, probably propagating it upwards.

Taking the sled error post as inspiration, the second approach is to generally change our function signatures to:

fn method(...) -> Result<Result<T, AccountError>, FatalError>

where FatalError is some struct that holds either a string or a Box<dyn Error>. AccountError is the current account error enum reduced to local errors only. As an optimization to the readability of this signature, we could define an alias

pub type AccountResult<T> = Result<Result<T, AccountError>, FatalError>

As pointed out in the sled post, this allows users to use ? for fatal errors, bubbling them up, while having to make another conscious decision about what to do with local errors. If they don't care, they can do a double ?? to bubble up both (only requiring two From impls). Since the enum has been reduced to local errors, it's much easier for users to determine if these errors can occur in their particular application, and if not use unwrap or match to unreachable!(). The problem of having variants that cannot occur in there remains, and may have to be addressed in documentation, which is definitely the biggest argument against it. Still, every Rust error handling approach has this problem today, and this whole thing is mainly an attempt to improve the situation, not fix it completely.

In summary, this proposes to:

Find consensus on a definition for fatal and local errors and separate as best we can into these two types.
Separate more clearly between internal and external errors, by having two error types.
Be more explicit in the return type, separating local and fatal errors

Would love to hear everyone's thoughts!

1 reply

olivereanderson Nov 3, 2021
Author

Really great ideas and it is always nice to see concrete examples from our code base and not some imaginary Foo :)

I like the idea of using:
fn method(...) -> Result<Result<T, AccountError>, FatalError> as it more clearly states which errors need to be handled at the call site, vs those that can be bubbled up, and if someone potentially wants to avoid all error handling then having to use the double ?? will hopefully make them think again as that feels a bit more like using unwrap.

olivereanderson · 2021-10-29T19:37:45Z

olivereanderson
Oct 29, 2021
Author

:::info
What is written in this comment is not a reaction to any other comments, but extends my initial comment/write-up with more arguments that I did not think of when this discussion was created.
:::

A note on stability

Another argument against enum style errors at the crate level is versioning; Unless the enum is marked with non_exhaustive it is a breaking change to add a new variant to it. Thus if someone adds a new fallible function to one of our (sub)-crates then if this function is capable of failing in some way that is not listed in the crate's error enum, then adding this as an additional variant to that enum is a breaking change. If we instead add the non_exhaustive attribute to the error enum(s) then we are denying callers the ability to exhaustively match on all variants. One could argue that callers probably wouldn't want to match against all cases of a huge enum with relatively unrelated variants anyway, but that probably suggests that it wasn't the correct representation to begin with.

Note that what is described above also applies to enum style errors on the module level, hence if we decide to have a module with a single enum style error, then we need to consider how likely it is for this module to have implemented everything that logically should or could belong there.

Hence if we want to provide exhaustive matching on the errors we expose in our public APIs and also easily add new functionality without introducing a major version, then the easiest way to achieve this goal is to avoid error types on the crate and (if possible) module level.

Do we have kitchen-sink enums on our hands ?

This post by Matklad speaks of the kitchen-sink enum anti-pattern. From what I understand it means to add all possible errors, including external ones, into a single enum. Even though our case is not as dire as the example from that post, the three problems listed there still (at least partially) apply to us:

exposing errors from underlying libraries makes them a part of your public API. Major semver bump in your dependency would require you to make a new major version as well.

Second, it sets all the implementation details in stone. For example, if you notice that the size of ConnectionDiscovery is huge, boxing this variant would be a breaking change.

Third, it is usually indicative of a larger design issue. Kitchen sink errors pack dissimilar failure modes into one type. But, if failure modes vary widely, it probably isn’t reasonable to handle them! This is an indication that the situation looks more like the case two (meaning the errors internal structure can be encapsulated).

Note that we mostly dont' re-export all the errors from our dependencies, so point 1. only partially applies, but we should still keep this in mind.

A more responsible way of pushing error handling to the caller

It is possible to push error handling to the caller/user in a more isolated fashion. The idea is well explained in the blog post by Matklad from the previous section so I will quote it directly here:

An often-working cure for error kitchensinkosis is the pattern of pushing errors to the caller.

Consider this example
fn my_function() -> Result<i32, MyError> {
 let thing = dep_function()?;
 ...
 Ok(92)
}
my_function calls dep_function, so MyError should be convertible from DepError. A better way to write the same might be this:
fn my_function(thing: DepThing) -> Result<i32, MyError> {
 ...
 Ok(92)
}
In this version, the caller is forced to invoke dep_function and handle its error. This exchanges more typing for more type-safety. MyError and DepError are now different types, and the caller can handle them separately. If DepError were a variant of MyError a runtime match-ing would be required.

Another benefit of this approach is that it tends to be easier to test, as it leads to having less side effects for any given function. The downside of this approach is that it can also lead to making more functions part of the public API even if the user is not interested in them, and the user is also forced to understand more dependencies . One could however argue that in some situations one would have to understand dep_function (in the example above) in order to handle the error of my_function (also in the example above) properly anyway. Moreover we as library authors/contributors could also help the user learn how to compose more functionality themselves by providing good code examples and guides.

If this alternative way of pushing error handling to the caller could be of interest to us, then we also need to consider how well it fits with the DID specifications, as they may specify exactly what a functions signature should be (see for instance resolution).

0 replies

olivereanderson · 2021-11-04T09:38:15Z

olivereanderson
Nov 4, 2021
Author

Status update: In our weekly meeting we agreed to change our error handling in this library and we will start with the identity-core crate.

I suggest that we keep this discussion open for now so that we may ask for suggestions and/or help while we are refactoring the errors.

0 replies

JelleMillenaar · 2021-11-04T10:41:28Z

JelleMillenaar
Nov 4, 2021
Collaborator

I don't have much to add, but I really liked reading the full discussion. From a framework perspective, our design philosophy centers around being convenient to implement. This does mean that error handling is an important thing to do right. As is mentioned in the discussions, I agree that we shouldn't force users to figure out which errors matter to them. We should do the heavy lifting and create manageable subsets for them. I like the idea of splitting the Fatal errors out and just keep them a bit vague, but make it clear these are fatal, so nothing you can really do about it. Lastly, I also think that the above point on breaking changes should be taken into account.

In short, my opinion is that we should take the burden of responsibility away from the library consumers, at the cost of more work.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DONE] Should we reconsider our error types? #460

{{title}}

Replies: 5 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

[DONE] Should we reconsider our error types? #460

olivereanderson Oct 27, 2021

Should we reconsider our error approach:

Introduction

Some background on how errors are represented in Rust

Representing errors in Rust

The Try operator

More advanced ways of representing errors in Rust

What are the best practices for error handling in Rust as of today?

One possible Suggestion for error guidelines in identity.rs

Replies: 5 comments · 2 replies

elenaf9 Oct 28, 2021

olivereanderson Oct 28, 2021 Author

PhilippGackstatter Oct 29, 2021

Fatal vs. Local

Internal vs. External

Convenience vs. explicitness

olivereanderson Nov 3, 2021 Author

olivereanderson Oct 29, 2021 Author

A note on stability

Do we have kitchen-sink enums on our hands ?

A more responsible way of pushing error handling to the caller

olivereanderson Nov 4, 2021 Author

JelleMillenaar Nov 4, 2021 Collaborator

olivereanderson
Oct 27, 2021

Replies: 5 comments 2 replies

elenaf9
Oct 28, 2021

olivereanderson Oct 28, 2021
Author

PhilippGackstatter
Oct 29, 2021

olivereanderson Nov 3, 2021
Author

olivereanderson
Oct 29, 2021
Author

olivereanderson
Nov 4, 2021
Author

JelleMillenaar
Nov 4, 2021
Collaborator