Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shutdown strengthening #16

Merged
merged 15 commits into from
Jan 2, 2024
Merged

Shutdown strengthening #16

merged 15 commits into from
Jan 2, 2024

Conversation

diondokter
Copy link
Collaborator

Using the fuzzer to simulate a power shutdown and adding a repair function that can solve it.

@diondokter
Copy link
Collaborator Author

Not sure if anybody's watching this space...
But if you are, what do you think of the repair function?
Should it be something that should be done automatically?

Ideally yes, but that'd be very annoying to implement...
Leaving it to the user is way easier.

@lulf
Copy link
Contributor

lulf commented Jan 2, 2024

@diondokter I like the repair function, very nice improvement. I think leaving it up to the user is better and feels more in line with the rest of the API to be honest.

Perhaps a rustdoc comment on the Corrupted enum variant pointing to the try_repair would be sufficient?

@diondokter
Copy link
Collaborator Author

@diondokter I like the repair function, very nice improvement. I think leaving it up to the user is better and feels more in line with the rest of the API to be honest.

Perhaps a rustdoc comment on the Corrupted enum variant pointing to the try_repair would be sufficient?

Ok, cool. I'll do that then.

But I got snagged by multiple writes on flash.
Map requires only single-write flash, but now its repair function requires flash that can write twice...
Queue requires twice-write flash, but now its repair function requires flash that can write three times...

So... I'm gonna have to spend some time to fix that, if that's even possible.

@diondokter diondokter mentioned this pull request Jan 2, 2024
…h an adapted crc to fix the multi-write issues
@diondokter diondokter marked this pull request as ready for review January 2, 2024 14:29
@diondokter
Copy link
Collaborator Author

Cool! I was able to fix it without doing major changes :D

…shutdown versions were a superset, so the old are not necessary anymore.
Find item now skips forward a full item header instead of just a word when it encounters corruption. This makes things more consistent.
@diondokter
Copy link
Collaborator Author

Turns out it's actually useful to fuzz in CI! Otherwise I'd already have merged it :P

@diondokter diondokter merged commit 51ad715 into master Jan 2, 2024
4 checks passed
@lulf
Copy link
Contributor

lulf commented Jan 5, 2024

@diondokter I've been testing the repair function a bit now, and I wanted to ask about a behavior I'm seeing in my initialization process, sssuming I've written gibberish to the queue partitions before startup:

  • Peek fails with corrupted error as expected.
  • Running try_repair succeeds
  • Running peek again fails with the corrupted error

The error in particular is "Corrupted: All pages are closed", which suggests it haven't realy repaired/erased anything.

@lulf
Copy link
Contributor

lulf commented Jan 5, 2024

In particular it's caused by the find_youngest_page at the end of the function:

    // All pages are closed... This is not correct.
    #[cfg(feature = "defmt")]
    defmt::error!("Corrupted: All pages are closed");

    Err(Error::Corrupted {
        #[cfg(feature = "_test")]
        backtrace: std::backtrace::Backtrace::capture(),
    })

I'm wondering if there is a condition here that the repair doesn't fully fix. Perhaps it should check the for this condition as well and return an error in that case?

This would make the use of the API a lot smoother, since you can trust try_repair to give you an error if something is wrong rather than trying to use the flash.

@diondokter
Copy link
Collaborator Author

diondokter commented Jan 5, 2024

Ah I see.

I designed the repair function so it can repair the state if it got reasonably corrupted from a previous correct state.
If you start with gibberish, then there's not really anything that can be repaired. It likely also won't be able to repair effects from random bitrot.

So right now the repair function is only repairing things that result from an unexpected shutoff with the assumption that bytes are written from start to end every transaction.

In your case, if you follow the documentation, you should end up with erasing the flash yourself so you start fresh again.

I guess that checking the flash for errors in the repair function would be nice for users, but there are so many things that could go wrong it'd be difficult to check them all.

@lulf
Copy link
Contributor

lulf commented Jan 5, 2024

Alright, thanks for explaining, that makes sense!

@diondokter diondokter deleted the shutdown-strengthening branch January 5, 2024 18:41
@diondokter diondokter mentioned this pull request Jan 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants