What caused so many failures in the 4k record size test on the London cluster #1334

leftwo · 2024-06-03T23:43:56Z

@faithanalog ran a test on the Madrid cluster, the details of which are recorded here:

https://github.com/oxidecomputer/artemiscellaneous/tree/main/cru-perf/2024-05-24-4krecordsize-omicron_c2f51e25-crucible_92fe269-propolis_6d7ed9a

There was some trouble getting all 64 VMs to start, and there could be noise here in the logs from those errors
vs. errors that happened under IO load.

The logs from propolis and crucible downstairs are here: /staff/artemis/cru-perf/2024-05-24-4krecordsize-omicron_c2f51e25-crucible_92fe269-propolis_6d7ed9a/logs.tar.gz

leftwo · 2024-06-04T00:01:07Z

A first pass search of all the logs I find:
No panics.

Searching upstairs logs for why clients disconnected:

cat logs/*/*propolis* | grep "client task is sending Done" | looker |  grep "client task is sending" > client-done.txt

Then vim and chop off the front of the file, then:

cat client-done.txt | sort -n | uniq -c

Gives us this summary:

1325 Done(ConnectionTimeout)
  95 Done(Timeout)
  67 Done(RequestedStop(TooManyOutstandingJobs))
  58 Done(RequestedStop(IOError))
  40 Done(Finished)
  35 Done(ConnectionFailed(Os { code: 146, kind: ConnectionRefused, message: "Connection refused" }))
  12 Done(RequestedStop(FailedLiveRepair))
   7 Done(WriteFailed(IO Error: Os { code: 12, kind: OutOfMemory, message: "Not enough space" }))
   1 Done(WriteFailed(IO Error: Os { code: 32, kind: BrokenPipe, message: "Broken pipe" }))

1325 Done(ConnectionTimeout)
This is when the upstairs is trying to connect to a downstairs, but
is not able to make the connection, so those are either never being able to connect, or after a
connection has failed, trying to reconnect when the downstairs is not answering.

95 Done(Timeout)
Happens when the downstairs fails to answer any message from the upstairs for 45 seconds.

67 Done(RequestedStop(TooManyOutstandingJobs))
Is when the upstairs has built up enough IO that it has to give up holding it for the downstairs
as the downstairs is either not answering at all, or is not answering at a fast enough rate.

leftwo · 2024-06-04T00:06:33Z

Logs are extracted at /staff/core/crucible-1334/4k-recordsize-debug

mkeeter · 2024-07-25T13:57:29Z

After doing some research (for RFD 490), here's what I found: when doing writes, illumos ZFS splits into one transaction per recordsize. This means that going from 128 KiB → 4 KiB recordsize, we end up with 32× more transactions, which (presumably) adds a bunch of overhead.

OpenZFS has a patch for this:
openzfs/zfs@b0cbc1aa9

We should consider backporting it and then testing small recordsizes again.

leftwo changed the title ~~Analysis of 4k record size failures on Madrid cluster~~ Analysis of 4k record size failures on London cluster Jun 3, 2024

leftwo changed the title ~~Analysis of 4k record size failures on London cluster~~ What caused so many failures in the 4k record size test on the London cluster Jun 3, 2024

mkeeter mentioned this issue Jun 5, 2024

Lock contention during iodriver runs #1337

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What caused so many failures in the 4k record size test on the London cluster #1334

What caused so many failures in the 4k record size test on the London cluster #1334

leftwo commented Jun 3, 2024 •

edited

Loading

leftwo commented Jun 4, 2024

leftwo commented Jun 4, 2024

mkeeter commented Jul 25, 2024

What caused so many failures in the 4k record size test on the London cluster #1334

What caused so many failures in the 4k record size test on the London cluster #1334

Comments

leftwo commented Jun 3, 2024 • edited Loading

leftwo commented Jun 4, 2024

leftwo commented Jun 4, 2024

mkeeter commented Jul 25, 2024

leftwo commented Jun 3, 2024 •

edited

Loading