correct to way to rejoin a failed node? #420

RoadRunnr · 2024-02-12T11:56:44Z

RoadRunnr
Feb 12, 2024

What is the correct way to rejoin a failed node when its WAL directory was lost (e.g. WAL directory residing on a tempfs of a container).

Consider the following:

nodes are brought up like this:

rebar3 shell --start-clean --name [email protected]
rebar3 shell --start-clean --name [email protected]
rebar3 shell --start-clean --name [email protected]

and then on ra1:

Erlang/OTP 26 [erts-14.2] [source] [64-bit] [smp:12:12] [ds:12:12:10] [async-threads:1] [jit:ns]

Eshell V14.2 (press Ctrl+G to abort, type help(). for help)
(ra1@hostname.local)1> ErlangNodes = ['[email protected]', '[email protected]', '[email protected]'],
                       erpc:multicall(ErlangNodes, ra, start, []),
                       ServerIds = [{quick_start, N} || N <- ErlangNodes],
                       ClusterName = quick_start,
                       Machine = {simple, fun erlang:'+'/2, 0},
                       {ok, ServersStarted, _ServersNotStarted} = ra:start_cluster(default, ClusterName, Machine, ServerIds),
                       {ok, StateMachineResult, LeaderId} = ra:process_command(hd(ServersStarted), 5),
                       {ok, 12, LeaderId1} = ra:process_command(LeaderId, 7).
{ok,12,{quick_start,'[email protected]'}}
(ra1@hostname.local)2> ra:members({local, quick_start}).
{ok,[{quick_start,'[email protected]'},
     {quick_start,'[email protected]'},
     {quick_start,'[email protected]'}],
    {quick_start,'[email protected]'}}
(ra1@hostname.local)3>

ra2 is killed uncleanly and its ra state is lost:

$ rebar3 shell --start-clean --name ra2@hostname.local
===> Verifying dependencies...
===> Analyzing applications...
===> Compiling ra
Erlang/OTP 26 [erts-14.2] [source] [64-bit] [smp:12:12] [ds:12:12:10] [async-threads:1] [jit:ns]

Eshell V14.2 (press Ctrl+G to abort, type help(). for help)
(ra2@hostname.local)1> 
$ rm -rf ra2@hostname.local/
$ rebar3 shell --start-clean --name ra2@hostname.local
===> Verifying dependencies...
===> Analyzing applications...
===> Compiling ra
Erlang/OTP 26 [erts-14.2] [source] [64-bit] [smp:12:12] [ds:12:12:10] [async-threads:1] [jit:ns]

Eshell V14.2 (press Ctrl+G to abort, type help(). for help)
(ra2@hostname.local)1>

Question: what is the correct way to rejoin ra2 to the cluster?

Doing a simple add doesn't work:

Erlang/OTP 26 [erts-14.2] [source] [64-bit] [smp:12:12] [ds:12:12:10] [async-threads:1] [jit:ns]

Eshell V14.2 (press Ctrl+G to abort, type help(). for help)
(ra2@hostname.local)1> ra:start(),
                       ServerId = {quick_start, node()},
                       ClusterName = quick_start,
                       Machine = {simple, fun erlang:'+'/2, 0}.
{simple,fun erlang:'+'/2,0}
(ra2@hostname.local)2> ra:start_server(default, ClusterName, ServerId, Machine, []),
                       ra:add_member({quick_start, '[email protected]'}, ServerId).
{error,already_member}

Passing an existing server to start_server does not change the outcome:

Erlang/OTP 26 [erts-14.2] [source] [64-bit] [smp:12:12] [ds:12:12:10] [async-threads:1] [jit:ns]

Eshell V14.2 (press Ctrl+G to abort, type help(). for help)
(ra2@hostname.local)1> ra:start(),
                       ServerId = {quick_start, node()},
                       ClusterName = quick_start,
                       Machine = {simple, fun erlang:'+'/2, 0}.
{simple,fun erlang:'+'/2,0}
(ra2@hostname.local)2> ra:start_server(default, ClusterName, ServerId, Machine, [{ClusterName, '[email protected]'}]).
ok
(ra2@hostname.local)3> ra:add_member({quick_start, '[email protected]'}, ServerId).
{error,already_member}

What works is to first delete the old server entry and the doing the add:

Erlang/OTP 26 [erts-14.2] [source] [64-bit] [smp:12:12] [ds:12:12:10] [async-threads:1] [jit:ns]

Eshell V14.2 (press Ctrl+G to abort, type help(). for help)
(ra2@hostname.local)1> ra:start(),
                       ServerId = {quick_start, node()},
                       ClusterName = quick_start,
                       Machine = {simple, fun erlang:'+'/2, 0}.
{simple,fun erlang:'+'/2,0}
(ra2@hostname.local)2> ra:start_server(default, ClusterName, ServerId, Machine, []),
                       ra:remove_member({quick_start, '[email protected]'}, ServerId).
{ok,{6,1},{quick_start,'[email protected]'}}
(ra2@hostname.local)3> timer:sleep(1000).
ok
(ra2@hostname.local)4> ra:add_member({quick_start, '[email protected]'}, ServerId).
{ok,{7,1},{quick_start,'[email protected]'}}

My question is now, it doing the remove_member the correct and expected way to handle this type of crash recovery?

And something related: a ra server could have a large amount of log entries when this happens. Adding a new server or readding a lost server like above will cause all log entries to be replayed on the newly added server. How can that be avoided? Or put into other words: what is the correct way of bringing up a server with a well known state and point in the log (essentially how to copy the state of anther node to newly added node)

Answered by kjnilsson

Feb 12, 2024

If your persisted state is lost or otherwise corrupted ra2 cannot be safely use in the cluster and you need to remove and re add it as you discovered.

There are no shortcuts to bypass the log replication for a new member I'm afraid.

View full answer

kjnilsson · 2024-02-12T12:44:17Z

kjnilsson
Feb 12, 2024
Maintainer

If your persisted state is lost or otherwise corrupted ra2 cannot be safely use in the cluster and you need to remove and re add it as you discovered.

There are no shortcuts to bypass the log replication for a new member I'm afraid.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

correct to way to rejoin a failed node? #420

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

correct to way to rejoin a failed node? #420

RoadRunnr Feb 12, 2024

Replies: 1 comment

kjnilsson Feb 12, 2024 Maintainer

RoadRunnr
Feb 12, 2024

kjnilsson
Feb 12, 2024
Maintainer