Skip to content
This repository has been archived by the owner on Jan 26, 2023. It is now read-only.

During a jobspec change to the redis-server task, the cluster did not automatically rebuild #34

Open
jcjones opened this issue Aug 13, 2022 · 3 comments

Comments

@jcjones
Copy link

jcjones commented Aug 13, 2022

I changed the name of the redis-server task (from server to redis-server) and redeployed the jobspec. For each allocation, after it restarted with the new task name, it failed to rejoin the cluster until being restarted a second time (via the Nomad GUI).

Attache-control logs:

time="2022-08-13T00:20:54Z" level=info msg="starting /usr/local/bin/attache-control"
time="2022-08-13T00:20:54Z" level=info msg="initializing a new redis client"
time="2022-08-13T00:20:54Z" level=info msg="initializing a new consul client"
time="2022-08-13T00:20:54Z" level=info msg="fetching scaling options from consul path 'service/redis-cluster/scaling'"
time="2022-08-13T00:20:57Z" level=info msg="this node is already part of an existing cluster"
time="2022-08-13T00:20:57Z" level=info msg="running until killed..."

Redis however is not part of a cluster:

10.0.32.81:20001> cluster nodes
0bd16fb965741d36e64304458b4f0264c248d25e 10.0.32.81:20001@30001 myself,master - 0 0 0 connected

The Redis log is very empty, no mention of being told to join:

1:C 13 Aug 2022 00:20:46.789 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
1:C 13 Aug 2022 00:20:46.789 # Redis version=6.2.7, bits=64, commit=00000000, modified=0, pid=1, just started
1:C 13 Aug 2022 00:20:46.789 # Configuration loaded
1:M 13 Aug 2022 00:20:46.796 # A key '__redis__compare_helper' was added to Lua globals which is not on the globals allow list nor listed on the deny list.
1:M 13 Aug 2022 00:20:46.796 # Server initialized
1:M 13 Aug 2022 00:20:46.796 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
1:M 13 Aug 2022 00:20:46.847 # IP address for this node updated to 10.0.32.81

This is trivially fixable with operator intervention by just restarting the alloc again.

@jcjones
Copy link
Author

jcjones commented Aug 13, 2022

Restarting just attache-control is not sufficient.

@jcjones
Copy link
Author

jcjones commented Aug 15, 2022

This appears to have been caused by the supplied consul server being unreachable at a firewall, and it took many minutes to log the connection failure.

I think there is still improvement to be made here in that error case but I don't know exactly what yet.

@jcjones
Copy link
Author

jcjones commented Aug 15, 2022

One significant thing to do here is to make it more clear whether Consul is timing out. I think that might be it for this issue, and we might do other things to improve the deployment model.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant