Question: Reproducing clusterd moka-housekeeper segfault #20136

tatsuya6502 · 2023-06-25T12:09:32Z

tatsuya6502
Jun 25, 2023

Hi. I am the moka creator and trying to reproduce #19746. If I could reproduce, I can start a root cause analysis.

From #19746:

What version of Materialize are you using?

9ab5f83

What is the issue?

This happened twice in a row in long Zippy w/ user tables test: https://buildkite.com/materialize/release-qualification/builds/197#018883bf-5086-4a43-a8b4-ede1ef192de5

I am running "Long Zippy w/ user tables test" locally at commit 9ab5f83, but I have not reproduced the segfault yet. moka-rs/moka#281

So far, I ran it with the following parameters:

80,000 actions (took 11 hours 41 minutes)
50,000 actions (took 7 hours 14 minutes)

I think I need to continue to run it so I can eventually reproduce the segfault. But I am not sure if I am doing right because I am new to Materialize.

Question 1: How often did the segfault happen?

How many times did you run release qualification jobs with moka enabled, and how many times did the segfault happen? (rough numbers are fine)
(I am asking this because I cannot see the Buildkite logs. It is asking me to log in but I am not sure if I can create one)

Question 2: Am I doing right on the test?

I am running "Long Zippy w/ user tables test" as the followings. Am I doing right?

$ git clone [email protected]:MaterializeInc/materialize.git
$ cd materialize
$ git checkout 9ab5f833b

$ ./bin/mzcompose --find zippy down -v
...

$ ./bin/mzcompose --find zippy run default --scenario UserTablesLarge --actions 80000
==> Collecting mzbuild images
materialize/ubuntu-base:mzbuild-WBP7JFLWFQYUIHLXR5JSOAJ74VVKI6O7
materialize/clusterd:mzbuild-XEZDVCORJVQEPNOVJEOF33HFME4GV6QA
materialize/materialized:mzbuild-TV5MN5FZHFZO2FGJWS43E7Y2ASI423UW
materialize/test-certs:mzbuild-6JWZ2MGLSYOCL3MLHIEIJMMLQLEOLCNY
materialize/postgres:mzbuild-ENTVWBFMO762DEEKAMURDMVMQ52GIO63
materialize/testdrive:mzbuild-NRRGSD7T3OT3MVNN4UFU7H3AXQWWPW4N
==> Running test case workflow-default
==> Running workflow default
...
Generating test...
Running test...
--- #1: KafkaStart
...
--- #2: CockroachStart
...
--- #3: MinioStart
...
--- #4: MzStart
...
--- #5: StoragedStart
...
--- #6: CreateTable
> CREATE TABLE table_0 (f1 INTEGER);
rows match; continuing at ts 1687609226.7478156
> INSERT INTO table_0 VALUES (0);
rows match; continuing at ts 1687609226.7642455
--- #7: CreateTable
> CREATE TABLE table_1 (f1 INTEGER);
rows match; continuing at ts 1687609227.1155088
> INSERT INTO table_1 VALUES (0);
rows match; continuing at ts 1687609227.1261883
--- #8: ShiftBackward table_0
> UPDATE table_0 SET f1 = f1 - 1538;
rows match; continuing at ts 1687609227.4865289
--- #9: ShiftForward table_1
> UPDATE table_1 SET f1 = f1 + 7091;
rows match; continuing at ts 1687609227.7569017
...

--- #79998: DeleteFromHead table_1
> DELETE FROM table_1 WHERE f1 > 16601224;
rows match; continuing at ts 1687651205.2661505
--- #79999: ShiftBackward table_0
> UPDATE table_0 SET f1 = f1 - 443;
rows match; continuing at ts 1687651205.803389
--- #80000: ShiftBackward table_0
> UPDATE table_0 SET f1 = f1 - 1272;
rows match; continuing at ts 1687651206.5515945
==> mzcompose: test case workflow-default succeeded

My environment

WSL2 (Windows Subsystem for Linux 2) on Windows 11 (x86_64)
- Allocated 8 logical CPU cores and 24GiB of memory to WSL2.
Docker Desktop for Windows
- Running containers in WSL2.
CPU: Intel Core i7-12700F
- 8 × performance cores × 2 hyper threads
- 4 × efficient cores
- Total 20 hyper threads
32GiB of memory

I allocated only 8 logical CPU cores to WSL2 because the following reasons:

It seems you are using an c5.2xlarge instance (which have 8 vCPUs) for release qualification.
I want to reduce the memory utilization:
- I found that allocating more logical cores increases the memory utilization.
- e.g. If I allocate all 20 logical cores, the memory utilization goes beyond 24GiB and swap space is used.

Other things I tried

Reviewed errors and warnings that Miri reported on the concurrent hash table (cht) module in moka:
- I concluded they are false positives moka-rs/moka/issues/279 comment
Modified our benchmarking tool to use Materialize's SegmentedBytes instead of Arc<[u8]> as a part of the value type. moka-rs/moka/issues/279 comment
- Ran it several times but could not reproduce the issue.

Answered by def-

Jun 26, 2023

I'm guessing that using a larger machine makes it easier to reproduce. I have now started a run on a c6a.12xlarge EC2 instance. Based on previous experience with this issue I hope to have a segfault in 3-6 hours. Will check back then.

View full answer

benesch · 2023-06-25T13:35:03Z

benesch
Jun 25, 2023
Maintainer

Hi @tatsuya6502, thanks very much for trying to repro this! As I think someone pointed out somewhere we were using moka v0.9 when we observed the crashes, so it's possible that the segfaults were fixed in a more recent version of moka. That said, the commit of Materialize that you're testing definitely includes moka v0.9:

materialize/Cargo.lock

Line 3233 in 9ab5f83

version = "0.9.6"

so I'm surprised you're not able to readily repro.

I wonder if it's something to do with the fact that you're using WSL. That introduces a layer of emulation, right? I wonder if that slows things down enough that you no longer see the segfaults.

CC'ing @MaterializeInc/qa. We can try to repro this on a scratch EC2 instance that looks very similar to our CI hardware. If it repros, we can give you SSH access to the machine for further debugging. How does that sound as a plan?

1 reply

tatsuya6502 Jun 25, 2023
Author

Thank you for the quick reply.

We can try to repro this on a scratch EC2 instance that looks very similar to our CI hardware. If it repros, we can give you SSH access to the machine for further debugging. How does that sound as a plan?

Thanks. I think no SSH access is needed; I do not think I can figure out the root cause by just accessing the instance after reproducing. To fix this kind of issue, I will have to do many experiments; modifying moka and Metirialize codes and run the same test to see whether the issue remains or not. This can take weeks (depending on how hard to reproduce the issue).

So it will be very helpful if you try to reproduce the issue using a scratch EC2 instance, and if it repros, then tell me how to build such an EC2 instance and how to run the test (without setting up Buildkite). This way, I can do the experiments on my own EC2 instance.

benesch · 2023-06-25T14:36:36Z

benesch
Jun 25, 2023
Maintainer

Sounds good! Hopefully someone from the QA team has time to attempt the repro early next week and give you the full set of instructions.

…

On Sun, Jun 25, 2023 at 10:26 AM Tatsuya Kawano ***@***.***> wrote: Thank you for the quick reply. We can try to repro this on a scratch EC2 instance that looks very similar to our CI hardware. If it repros, we can give you SSH access to the machine for further debugging. How does that sound as a plan? Thanks. I think no SSH access is needed; I do not think I can figure out the root cause by just accessing the instance after reproducing. To fix this kind of issue, I will have to do many experiments; modifying moka and Metirialize codes and run the same test to see whether the issue remains or not. This can take weeks (depending on how hard to reproduce the issue). So it will be very helpful if you try to reproduce the issue using a scratch EC2 instance, and if it repros, then tell me how to build such an EC2 instance and how to run the test (without setting up Buildkite). This way, I can do the experiments on my own EC2 instance. — Reply to this email directly, view it on GitHub <#20136 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGXSIG346KPG33IRUNLMK3XNBDBLANCNFSM6AAAAAAZTDSJYA> . You are receiving this because you commented.Message ID: ***@***.*** .com>

0 replies

def- · 2023-06-26T09:07:48Z

def-
Jun 26, 2023
Maintainer

I'm guessing that using a larger machine makes it easier to reproduce. I have now started a run on a c6a.12xlarge EC2 instance. Based on previous experience with this issue I hope to have a segfault in 3-6 hours. Will check back then.

7 replies

tatsuya6502 Jun 26, 2023
Author

@def- OK. I am done with the machine. Thank you for your help!

I downloaded docker logs. I also searched for a core dump, but as expected, I could not find it (Taking core dumps will be disabled by default). Now I know how to reproduce using a c6a.12xlarge instance, so I will start the same instance and try to reproduce the issue with core dumps enabled.

Also I ran moka's benchmark program on the machine to see if I can get segmentation fault. But I could not. I will analyze the Docker logs and Materialize source code to see if I can add something to the benchmark program to reproduce the issue. (The benchmark program is already using Materialize's SegmentedBytes taken from src/ore/src as a part of the value type)

I will update you guys when I made some progress.

## Running maka's benchmark program.

$ git clone [email protected]:moka-rs/moka.git
$ cd moka
$ git checkout v0.9.6

## This change will increase the chances of race conditions.
$ vi src/sync_base/base_cache.rs
$ git diff
diff --git a/src/sync_base/base_cache.rs b/src/sync_base/base_cache.rs
index 7cfb451..d6c101c 100644
--- a/src/sync_base/base_cache.rs
+++ b/src/sync_base/base_cache.rs
@@ -904,7 +904,7 @@ where
         let initial_capacity = initial_capacity
             .map(|cap| cap + WRITE_LOG_SIZE)
             .unwrap_or_default();
-        const NUM_SEGMENTS: usize = 64;
+        const NUM_SEGMENTS: usize = 1;
         let cache = crate::cht::SegmentedHashMap::with_num_segments_capacity_and_hasher(
             NUM_SEGMENTS,
             initial_capacity,

$ cd
$ git clone [email protected]:moka-labs/moka-gh279-materialize-memory-corruption/mokabench.git
$ cd mokabench/

## Use local moka checkout.
$ vi Cargo.toml
$ git diff Cargo.toml
diff --git a/Cargo.toml b/Cargo.toml
index 26fbbc2..991fdba 100644
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -61,7 +61,7 @@ features = ["future"]
 [dependencies.moka09]
 package = "moka"
 optional = true
-version = "=0.9.6"
+path = "../moka"
 # features = ["future", "dash"]
 default-features = false
 features = ["sync", "dash"]

$ cargo tree -i moka
moka v0.9.6 (/home/ubuntu/moka)
└── mokabench v0.9.0 (/home/ubuntu/mokabench)

## Run the benchmark.
$ cd datasets/
$ curl -L -O https://www.dropbox.com/sh/9ii9sc7spcgzrth/YGfSt_Tjmm/Papers/ARCTraces/S3.lis
$ cd ..
$ ulimit -c unlimited

$ cargo run --release -- --repeat 5 --invalidate --size-aware --eviction-listener immediate
...

Config { trace_file: S3, ttl: None, tti: None, num_clients: None, repeat: Some(5), insertion_delay: None, insert_once: false, invalidate: true, invalidate_all: false, invalidate_entries_if: false, iterate: false, eviction_listener: Immediate, size_aware: true, entry_api: false, per_key_expiration: false }

Cache, Max Capacity, Clients, Inserts, Reads, Hit Ratio, Invalidates, Evicted by Size, Expired, Duration Secs
Moka Async Cache, 3276800000, 16, 61579775, 65257906, 17.065, 846352, 59382083, 0, 273.759
Moka Async Cache, 3276800000, 24, 61514196, 65257906, 17.165, 842227, 59315515, 0, 279.364
Moka Async Cache, 3276800000, 32, 61560338, 65257906, 17.095, 845893, 59361859, 0, 286.001
Moka Async Cache, 3276800000, 40, 61604191, 65257906, 17.027, 848807, 59407571, 0, 301.307
Moka Async Cache, 3276800000, 48, 61654887, 65257906, 16.950, 848067, 59465545, 0, 307.807
Moka Async Cache, 13107200000, 16, 41288942, 65257906, 48.158, 2708729, 34657974, 0, 357.201
Moka Async Cache, 13107200000, 24, 41312540, 65257906, 48.122, 2683806, 34713494, 0, 360.830
^C

def- Jun 26, 2023
Maintainer

I'll try again with a coredump and provide it, then shut down the server.

tatsuya6502 Jun 26, 2023
Author

I'll try again with a coredump and provide it

That will help. Thank you again!

I wonder moka-housekeeper thread crashes when dropping a value SegmentedBytes, or when dropping the moka::Cache itself. The coredump may (or may not) tell the answer.

(It is midnight here UTC+0800. I will check message/email tomorrow morning)

def- Jun 27, 2023
Maintainer

Unfortunately I also didn't manage to enable coredumps for my second run, and then the third didn't manage to reproduce the issue. I have tried on my local machine since then and been unable to reproduce there. I'll try again, but seems highly timing dependent.

tatsuya6502 Jun 28, 2023
Author

Thank you for the update! All my initial questions were answered already so I should be okay.

Actually, the crash just happened on my local PC (WSL2 on Windows 11, 20 hyperthreads). I did not enable coredump (ahh!) but I will do next time. (I changed some internal parameters of moka that may increase the chance of crash. I was also adding tracing logs to moka and checking if these messages appear in the log. I did not expect the crash will soon happen on my PC)

I will continue using my PC, and when needed, I will create and use an EC2 instance too. So, you can stop now if you want.

## WSL2 on Windows 11
$ uname -a
Linux *** 5.15.90.1-microsoft-standard-WSL2 #1 SMP Fri Jan 27 02:56:13 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

$ ./bin/mzcompose --find zippy run default --scenario UserTablesLarge --actions 100000
...

--- #45221: ValidateView
--- #45222: DeleteFromTail table_0
> DELETE FROM table_0 WHERE f1 < 8407004;
rows didn't match; sleeping to see if dataflow catches up 50ms 75ms 113ms 169ms 253ms 380ms 570ms 854ms 1s 2s 3s 4s 6s 10s 15s 22s 33s 49s 74s 111s 166s 97s
^^^ +++
1:1: error: preparing query failed: connection closed
     |
   1 | > DELETE FROM table_0 WHERE f1 < 8407004;
     | ^
+++ !!! Error Report
1 errors were encountered during execution

$ sudo dmesg
...
[29921.897189] moka-housekeepe[4871]: segfault at 99e ip 00005614c502abc9 sp 00007fa860378f70 error 4 in clusterd[5614c21ed000+42b7000]
[29921.898701] Code: 0f 7f 4c 24 10 66 0f 70 c8 ee 66 0f ef 0d 7f 0f 24 fc 66 0f 7f 4c 24 20 66 0f 7f 44 24 30 48 c7 44 24 50 00 00 00 00 48 8b 06 <48> 8b 70 18 48 8b 50 20 48 8d 5c 24 10 48 89 df e8 42 db 0a 00 c6
[29921.900400] potentially unexpected fatal signal 11.
[29921.900857] CPU: 15 PID: 4871 Comm: moka-housekeepe Not tainted 5.15.90.1-microsoft-standard-WSL2 #1
...
[29925.596029] potentially unexpected fatal signal 6.
[29925.597071] CPU: 0 PID: 3284 Comm: tokio:work-5 Not tainted 5.15.90.1-microsoft-standard-WSL2 #1
...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: Reproducing clusterd moka-housekeeper segfault #20136

{{title}}

{{editor}}'s edit

{{editor}}'s edit

What version of Materialize are you using?

What is the issue?

Replies: 3 comments 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Question: Reproducing clusterd moka-housekeeper segfault #20136

tatsuya6502 Jun 25, 2023

What version of Materialize are you using?

What is the issue?

Question 1: How often did the segfault happen?

Question 2: Am I doing right on the test?

My environment

Other things I tried

Replies: 3 comments · 8 replies

benesch Jun 25, 2023 Maintainer

tatsuya6502 Jun 25, 2023 Author

benesch Jun 25, 2023 Maintainer

def- Jun 26, 2023 Maintainer

tatsuya6502 Jun 26, 2023 Author

def- Jun 26, 2023 Maintainer

tatsuya6502 Jun 26, 2023 Author

def- Jun 27, 2023 Maintainer

tatsuya6502 Jun 28, 2023 Author

tatsuya6502
Jun 25, 2023

Replies: 3 comments 8 replies

benesch
Jun 25, 2023
Maintainer

tatsuya6502 Jun 25, 2023
Author

benesch
Jun 25, 2023
Maintainer

def-
Jun 26, 2023
Maintainer

tatsuya6502 Jun 26, 2023
Author

def- Jun 26, 2023
Maintainer

tatsuya6502 Jun 26, 2023
Author

def- Jun 27, 2023
Maintainer

tatsuya6502 Jun 28, 2023
Author