Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New nemeses in Jepsen v0.3.0 #66

Open
cole-miller opened this issue Dec 10, 2022 · 3 comments
Open

New nemeses in Jepsen v0.3.0 #66

cole-miller opened this issue Dec 10, 2022 · 3 comments
Labels
enhancement New feature or request

Comments

@cole-miller
Copy link
Contributor

Jepsen v0.3.0 introduces a "packet" nemesis that messes with the network and a "file corruption" nemesis that flips bits in disk files. We should look into enabling these (I tried doing it in #65, but ran into a lot of weird errors).

@MathieuBordere
Copy link
Contributor

MathieuBordere commented Dec 12, 2022

File corruption nemesis might be interesting. We would need to investigate if a packet nemesis makes sense for us as we implicitly rely on a reliable transport like TCP. If the packet nemesis would corrupt packets that arrive in libraft, than that will for sure break the system as we don't do any integrity checks on the received data. Packet loss would probably just test the TCP implementation we're using. Packet delay could be interesting though.

@nurturenature
Copy link
Contributor

Hi,

What a wonderful Jepsen test!
The use of unshare/nsenter for Jepsen db nodes is 🤯.
And it's real, a continuously run GitHub Action!

Jepsen v0.3.0 introduces a "packet" nemesis ... and a "file corruption" nemesis ...
We should look into enabling these (I tried doing it in #65, but ran into a lot of weird errors).

I've been able to run the tests with the latest Jepsen in:

  • Jepsen reference environment, Debian/LXC/ssh
  • Github action, Ubuntu/namespaces/nsenter

with just a few changes.

Re packet nemesis, as noted:

Packet loss would probably just test the TCP implementation

Yes, e.g. loss is handled deep in the stack and has no impact.
In testing other databases, I have found that delay/rate/corruption/duplication/reorder impact message delivery/timing and has helped expose corner cases.

A sample dqlite set with packet-nemesis:

# target the leader for packet faults

:nemesis	:info	:start-packet	[:primaries {:duplicate {:percent :20%, :correlation :75%}, :reorder {:percent :20%, :correlation :75%}}]
:nemesis	:info	:start-packet	[:shaped {"n1" #{"n2" "n5" "n4" "n3"}, "n2" #{"n1"}, "n3" #{"n1"}, "n4" #{"n1"}, "n5" #{"n1"}} :netem [:duplicate :20% :75% :reorder :20% :75% :delay :50ms :10ms :25% :distribution :normal]]
...

# see impact with increased latency and # of add operations that fail:

:fail	:add	34	:locked

packet nemesis latency

Re file corruption nemesis:

  • easy to overdue it
  • combines well with other faults to get into corner cases

I am hoping that you will be open to considering a sequence of PRs:

  • GitHub Action

    • support manually triggering a test run with arbitrary test/raft/dqlite repositories/branches
  • Latest Jepsen

    • update Jepsen version
    • dqlite.db/install
      • currently only works with prebuilt -binary and namespace/nsenter nodes
      • correctly build app
      • enable working with LXC/ssh nodes, e.g. Jepsen reference environment
  • Update existing tests

    • current Jepsen idioms/patterns
    • (slightly) more comprehensive generator, checker, partition types possible
    • final read after healing/quiesce
    • nicer plots, title, etc
  • Add packet and file corruption nemesis

I though it best to provide some context before submitting any PRs.

I'll go ahead and submit adding a manual trigger to the GitHub action and hope you're open to considering the rest.

Thanks!

@nurturenature
Copy link
Contributor

Hoping the PRs aren't too much.

The tests are easy to work on.
The database and client functions, and general structure are good. 👍

#88 for list-append finishes the pass through the test workloads.

Will go through the nemeses next.

The action also seems to be more reliable.
In the last 3 days, 4 failures:

  • 3 like
    {:assert {:node "n1", :line "2023/03/12 14:10:05.200249 for jepsen: extra online spare"}}
  • 1 like
    ERROR: problem running ufw-init
    Another app is currently holding the xtables lock. Perhaps you want to use the -w option?
    

@MathieuBordere MathieuBordere added the enhancement New feature or request label Jun 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants