verify_dt_converge #30

martinsumner · 2024-12-09T14:01:35Z

The test fails intermittently, although it appears to fail consistently when testing gset (and not other datatypes).

Test seems to most often fail at check 3a, with only one of the expected values being returned:

Expected [<<"DANG">>,<<"Z">>,<<"Z^2">>], got {ok,{gset,[<<"Z">>],[],undefined}}

So the values [<<"DANG">>,<<"Z^2">>] do not appear to get merged - i.e. the update in 2a, on this side of the partition does not happen.

Note that the test doesn't assert that the check is ok until the final check. So the test may appear to fail on the final check, but it has already "failed" on the previous check, ti just the wait_until timed out and returned an error which was ignored.

The text was updated successfully, but these errors were encountered:

martinsumner · 2024-12-09T14:19:37Z

The issue with update_2a appears to be with the PB change to the GSET (there is a change to a PB object using the PB client, then a change to a HTTP object using the HTTP client). The PB change returns {error, timeout} - hence why check2a fails.

It is not obvious why this update should timeout, given that other updates to other objects have not timed out. It is consistently this update which times out though (intermittently still - sometimes this update returns ok and the test will go on to pass).

The positive here, is that the update did fail, and so this doesn't indicate a fundamental problem with the data type i.e. it was failing to merge updates which had succeeded

martinsumner · 2024-12-09T14:25:32Z

Pausing for 1s after the partition in the cluster, before starting the updates, stops the timeout from occurring - and leads to the test passing more consistently.

Note that the HTTP client can also fail, but in the case where the HTTP client fails the failure in the test is immediately as the HTTP client does not handle the request timeout (it crashes instead) like the PB client:

verify_dt_converge failed: {function_clause,[{rhc_dt,decode_error,[update,req_timedout] ...

martinsumner · 2024-12-09T15:38:56Z

This appears to be an issue with the test harness not waiting for the cluster to be stable after built. By default the rt:build_cluster function doesn't wait for transfers to complete.

In the case of this test, the vnode (to be used as the PUT coordinator) is initiated almost immediately before the PUT_FSM starts to use it. It appears if this is > 5ms before, the test works, and if it is < 5ms before the test fails, due to the update timeout - there the update times out as a coordinator is selected but the message is never received by the coordinator.

This is potentially a bug - that a vnode can seem up and be selected as a coordinator, but something in the path to routing that vnode a request isn't ready. Hence the PUT_FSM sends the local request to coordinate the PUT, and the vnode never receives.

Adding rt:wait_until_transfers_complete(Nodes) after the cluster build appears to make the test pass consistently.

martinsumner · 2024-12-09T15:50:47Z

#31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

verify_dt_converge #30

verify_dt_converge #30

martinsumner commented Dec 9, 2024 •

edited

Loading

martinsumner commented Dec 9, 2024 •

edited

Loading

martinsumner commented Dec 9, 2024 •

edited

Loading

martinsumner commented Dec 9, 2024 •

edited

Loading

martinsumner commented Dec 9, 2024

verify_dt_converge #30

verify_dt_converge #30

Comments

martinsumner commented Dec 9, 2024 • edited Loading

martinsumner commented Dec 9, 2024 • edited Loading

martinsumner commented Dec 9, 2024 • edited Loading

martinsumner commented Dec 9, 2024 • edited Loading

martinsumner commented Dec 9, 2024

martinsumner commented Dec 9, 2024 •

edited

Loading

martinsumner commented Dec 9, 2024 •

edited

Loading

martinsumner commented Dec 9, 2024 •

edited

Loading

martinsumner commented Dec 9, 2024 •

edited

Loading