Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

verify_dt_converge #30

Open
martinsumner opened this issue Dec 9, 2024 · 4 comments
Open

verify_dt_converge #30

martinsumner opened this issue Dec 9, 2024 · 4 comments

Comments

@martinsumner
Copy link

martinsumner commented Dec 9, 2024

The test fails intermittently, although it appears to fail consistently when testing gset (and not other datatypes).

Test seems to most often fail at check 3a, with only one of the expected values being returned:

Expected [<<"DANG">>,<<"Z">>,<<"Z^2">>], got {ok,{gset,[<<"Z">>],[],undefined}}

So the values [<<"DANG">>,<<"Z^2">>] do not appear to get merged - i.e. the update in 2a, on this side of the partition does not happen.

Note that the test doesn't assert that the check is ok until the final check. So the test may appear to fail on the final check, but it has already "failed" on the previous check, ti just the wait_until timed out and returned an error which was ignored.

@martinsumner
Copy link
Author

martinsumner commented Dec 9, 2024

The issue with update_2a appears to be with the PB change to the GSET (there is a change to a PB object using the PB client, then a change to a HTTP object using the HTTP client). The PB change returns {error, timeout} - hence why check2a fails.

It is not obvious why this update should timeout, given that other updates to other objects have not timed out. It is consistently this update which times out though (intermittently still - sometimes this update returns ok and the test will go on to pass).

The positive here, is that the update did fail, and so this doesn't indicate a fundamental problem with the data type i.e. it was failing to merge updates which had succeeded

@martinsumner
Copy link
Author

martinsumner commented Dec 9, 2024

Pausing for 1s after the partition in the cluster, before starting the updates, stops the timeout from occurring - and leads to the test passing more consistently.

Note that the HTTP client can also fail, but in the case where the HTTP client fails the failure in the test is immediately as the HTTP client does not handle the request timeout (it crashes instead) like the PB client:

verify_dt_converge failed: {function_clause,[{rhc_dt,decode_error,[update,req_timedout] ...

@martinsumner
Copy link
Author

martinsumner commented Dec 9, 2024

This appears to be an issue with the test harness not waiting for the cluster to be stable after built. By default the rt:build_cluster function doesn't wait for transfers to complete.

In the case of this test, the vnode (to be used as the PUT coordinator) is initiated almost immediately before the PUT_FSM starts to use it. It appears if this is > 5ms before, the test works, and if it is < 5ms before the test fails, due to the update timeout - there the update times out as a coordinator is selected but the message is never received by the coordinator.

This is potentially a bug - that a vnode can seem up and be selected as a coordinator, but something in the path to routing that vnode a request isn't ready. Hence the PUT_FSM sends the local request to coordinate the PUT, and the vnode never receives.

Adding rt:wait_until_transfers_complete(Nodes) after the cluster build appears to make the test pass consistently.

@martinsumner
Copy link
Author

#31

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant