-
What is the correct way to rejoin a failed node when its WAL directory was lost (e.g. WAL directory residing on a tempfs of a container). Consider the following:
and then on ra1: Erlang/OTP 26 [erts-14.2] [source] [64-bit] [smp:12:12] [ds:12:12:10] [async-threads:1] [jit:ns]
Eshell V14.2 (press Ctrl+G to abort, type help(). for help)
(ra1@hostname.local)1> ErlangNodes = ['[email protected]', '[email protected]', '[email protected]'],
erpc:multicall(ErlangNodes, ra, start, []),
ServerIds = [{quick_start, N} || N <- ErlangNodes],
ClusterName = quick_start,
Machine = {simple, fun erlang:'+'/2, 0},
{ok, ServersStarted, _ServersNotStarted} = ra:start_cluster(default, ClusterName, Machine, ServerIds),
{ok, StateMachineResult, LeaderId} = ra:process_command(hd(ServersStarted), 5),
{ok, 12, LeaderId1} = ra:process_command(LeaderId, 7).
{ok,12,{quick_start,'[email protected]'}}
(ra1@hostname.local)2> ra:members({local, quick_start}).
{ok,[{quick_start,'[email protected]'},
{quick_start,'[email protected]'},
{quick_start,'[email protected]'}],
{quick_start,'[email protected]'}}
(ra1@hostname.local)3>
$ rebar3 shell --start-clean --name ra2@hostname.local
===> Verifying dependencies...
===> Analyzing applications...
===> Compiling ra
Erlang/OTP 26 [erts-14.2] [source] [64-bit] [smp:12:12] [ds:12:12:10] [async-threads:1] [jit:ns]
Eshell V14.2 (press Ctrl+G to abort, type help(). for help)
(ra2@hostname.local)1>
$ rm -rf ra2@hostname.local/
$ rebar3 shell --start-clean --name ra2@hostname.local
===> Verifying dependencies...
===> Analyzing applications...
===> Compiling ra
Erlang/OTP 26 [erts-14.2] [source] [64-bit] [smp:12:12] [ds:12:12:10] [async-threads:1] [jit:ns]
Eshell V14.2 (press Ctrl+G to abort, type help(). for help)
(ra2@hostname.local)1> Question: what is the correct way to rejoin Doing a simple add doesn't work: Erlang/OTP 26 [erts-14.2] [source] [64-bit] [smp:12:12] [ds:12:12:10] [async-threads:1] [jit:ns]
Eshell V14.2 (press Ctrl+G to abort, type help(). for help)
(ra2@hostname.local)1> ra:start(),
ServerId = {quick_start, node()},
ClusterName = quick_start,
Machine = {simple, fun erlang:'+'/2, 0}.
{simple,fun erlang:'+'/2,0}
(ra2@hostname.local)2> ra:start_server(default, ClusterName, ServerId, Machine, []),
ra:add_member({quick_start, '[email protected]'}, ServerId).
{error,already_member} Passing an existing server to start_server does not change the outcome: Erlang/OTP 26 [erts-14.2] [source] [64-bit] [smp:12:12] [ds:12:12:10] [async-threads:1] [jit:ns]
Eshell V14.2 (press Ctrl+G to abort, type help(). for help)
(ra2@hostname.local)1> ra:start(),
ServerId = {quick_start, node()},
ClusterName = quick_start,
Machine = {simple, fun erlang:'+'/2, 0}.
{simple,fun erlang:'+'/2,0}
(ra2@hostname.local)2> ra:start_server(default, ClusterName, ServerId, Machine, [{ClusterName, '[email protected]'}]).
ok
(ra2@hostname.local)3> ra:add_member({quick_start, '[email protected]'}, ServerId).
{error,already_member} What works is to first delete the old server entry and the doing the add: Erlang/OTP 26 [erts-14.2] [source] [64-bit] [smp:12:12] [ds:12:12:10] [async-threads:1] [jit:ns]
Eshell V14.2 (press Ctrl+G to abort, type help(). for help)
(ra2@hostname.local)1> ra:start(),
ServerId = {quick_start, node()},
ClusterName = quick_start,
Machine = {simple, fun erlang:'+'/2, 0}.
{simple,fun erlang:'+'/2,0}
(ra2@hostname.local)2> ra:start_server(default, ClusterName, ServerId, Machine, []),
ra:remove_member({quick_start, '[email protected]'}, ServerId).
{ok,{6,1},{quick_start,'[email protected]'}}
(ra2@hostname.local)3> timer:sleep(1000).
ok
(ra2@hostname.local)4> ra:add_member({quick_start, '[email protected]'}, ServerId).
{ok,{7,1},{quick_start,'[email protected]'}} My question is now, it doing the And something related: a ra server could have a large amount of log entries when this happens. Adding a new server or readding a lost server like above will cause all log entries to be replayed on the newly added server. How can that be avoided? Or put into other words: what is the correct way of bringing up a server with a well known state and point in the log (essentially how to copy the state of anther node to newly added node) |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
If your persisted state is lost or otherwise corrupted ra2 cannot be safely use in the cluster and you need to remove and re add it as you discovered. There are no shortcuts to bypass the log replication for a new member I'm afraid. |
Beta Was this translation helpful? Give feedback.
If your persisted state is lost or otherwise corrupted ra2 cannot be safely use in the cluster and you need to remove and re add it as you discovered.
There are no shortcuts to bypass the log replication for a new member I'm afraid.