Bricks are failing to connect to the volume post gluster node reboot #1457

PrasadDesala · 2019-01-03T10:00:23Z

Bricks are failing to connect to the volume post gluster node reboot.

Observed behavior

On a system having 102 PVCs with brick-mux enabled I rebooted gluster-kube1-0 pod. After sometime the gluster pod is back online and is connected to the trusted pool but bricks on that gluster node are failing to connect to the volume.

[root@gluster-kube1-0 /]# ps -ef | grep -i glusterfsd
root 30332 59 0 09:52 pts/3 00:00:00 grep --color=auto -i glusterfsd
[root@gluster-kube1-0 /]# glustercli volume status pvc-db2b6e88-0f29-11e9-aaf6-525400933534
Volume : pvc-db2b6e88-0f29-11e9-aaf6-525400933534
+--------------------------------------+-------------------------------+-----------------------------------------------------------------------------------------+--------+-------+------+
| BRICK ID | HOST | PATH | ONLINE | PORT | PID |
+--------------------------------------+-------------------------------+-----------------------------------------------------------------------------------------+--------+-------+------+
| 129ac9de-9e60-4227-99df-48d7e17238f9 | gluster-kube3-0.glusterd2.gcs | /var/run/glusterd2/bricks/pvc-db2b6e88-0f29-11e9-aaf6-525400933534/subvol1/brick1/brick | true | 35692 | 4034 |
| 46a34351-19a2-4fd2-b692-ea07fbe4f71d | gluster-kube1-0.glusterd2.gcs | /var/run/glusterd2/bricks/pvc-db2b6e88-0f29-11e9-aaf6-525400933534/subvol1/brick2/brick | false | 0 | 0 |
| 0935a101-2e0d-4c5f-914f-0e4562602950 | gluster-kube2-0.glusterd2.gcs | /var/run/glusterd2/bricks/pvc-db2b6e88-0f29-11e9-aaf6-525400933534/subvol1/brick3/brick | true | 39067 | 4115 |
+--------------------------------------+-------------------------------+-----------------------------------------------------------------------------------------+--------+-------+------+

I am seeing below continuous messages in glusterd2 logs,
time="2019-01-03 09:52:57.982317" level=error msg="failed to connect to brick, aborting volume profile operation" brick="6257213e-de5c-4ae5-867d-38e0fd5abc0e:/var/run/glusterd2/bricks/pvc-81d554b4-0f27-11e9-aaf6-525400933534/subvol1/brick1/brick" error="dial unix /var/run/glusterd2/e70300fdb0bea4a4.socket: connect: connection refused" reqid=63bce8cc-c403-4978-8137-bb3ae361b496 source="[volume-profile.go:246:volumes.txnVolumeProfile]" txnid=e763af77-19f2-4935-bd02-9c65be68657a
time="2019-01-03 09:52:57.982371" level=error msg="Step failed on node." error="dial unix /var/run/glusterd2/e70300fdb0bea4a4.socket: connect: connection refused" node=6257213e-de5c-4ae5-867d-38e0fd5abc0e reqid=63bce8cc-c403-4978-8137-bb3ae361b496 source="[step.go:120:transaction.runStepFuncOnNodes]" step=volume.Profile txnid=e763af77-19f2-4935-bd02-9c65be68657a
time="2019-01-03 09:52:57.997172" level=info msg="client connected" address="10.233.64.5:48521" server=sunrpc source="[server.go:148:sunrpc.(*SunRPC).acceptLoop]" transport=tcp
time="2019-01-03 09:52:57.998020" level=error msg="registry.SearchByBrickPath() failed for brick" brick=/var/run/glusterd2/bricks/pvc-82196ac3-0f27-11e9-aaf6-525400933534/subvol1/brick1/brick error="SearchByBrickPath: port for brick /var/run/glusterd2/bricks/pvc-82196ac3-0f27-11e9-aaf6-525400933534/subvol1/brick1/brick not found" source="[rpc_prog.go:104:pmap.(*GfPortmap).PortByBrick]"
time="2019-01-03 09:52:57.998383" level=info msg="client disconnected" address="10.233.64.5:48521" server=sunrpc source="[server.go:109:sunrpc.(*SunRPC).pruneConn]"

Expected/desired behavior

Post gluster pod reboot, bricks should connect back to the volume without any issues,

Details on how to reproduce (minimal and precise)

Create a 3 node gcs system using vagrant.
Create 102 PVCs with brick mux enabled.
Reboot a gluster pod.
Once the pod is back online, check glustercli volume status

Information about the environment:

Glusterd2 version used (e.g. v4.1.0 or master): v6.0-dev.97.gita6fc33c
Operating system used: CentOS 7.6
Glusterd2 compiled from sources, as a package (rpm/deb), or container:
Using External ETCD: (yes/no, if yes ETCD version): yes, 3.3.8
If container, which container image:
Using kubernetes, openshift, or direct install:
If kubernetes/openshift, is gluster running inside kubernetes/openshift or outside: kubernetes

PrasadDesala · 2019-01-03T10:10:20Z

new_statedump_kube-1.txt
gluster-kube3-glusterd2.log.gz
gluster-kube2-glusterd2.log.gz
gluster-kube1-glusterd2.log.gz

vpandey-RH · 2019-01-03T10:25:42Z

@atinmu This is due to delay in brick SignIn i believe. @PrasadDesala Can you give the bricks some more time and check after a while if the brick still shows 0.

PrasadDesala · 2019-01-03T10:28:57Z

@atinmu This is due to delay in brick SignIn i believe. @PrasadDesala Can you give the bricks some more time and check after a while if the brick still shows 0.

@vpandey-RH Its been more than 45 minutes. Still I see bricks are trying to re-connect.

vpandey-RH · 2019-01-03T10:33:16Z

IS there any change in number of bricks that were previously showing port as 0 ?

vpandey-RH · 2019-01-03T10:41:25Z

@PrasadDesala Seems like there is no glusterfsd running on the node that was rebooted. Can you check it once ?

PrasadDesala · 2019-01-03T10:46:11Z

@PrasadDesala Seems like there is no glusterfsd running on the node that was rebooted. Can you check it once ?

Yes it seems brick process is not running after gluster node reboot. So the brick process is showing as '0' for that node.

Below is the output snip of volume status for a volume;
Before node reboot:
[root@gluster-kube1-0 /]# glustercli volume status
Volume : pvc-30622ade-0f26-11e9-aaf6-525400933534
+--------------------------------------+-------------------------------+-----------------------------------------------------------------------------------------+--------+-------+------+
| BRICK ID | HOST | PATH | ONLINE | PORT | PID |
+--------------------------------------+-------------------------------+-----------------------------------------------------------------------------------------+--------+-------+------+
| 2841d69f-8d1d-4013-bd6a-4aaea9031f9b | gluster-kube1-0.glusterd2.gcs | /var/run/glusterd2/bricks/pvc-30622ade-0f26-11e9-aaf6-525400933534/subvol1/brick1/brick | true | 46726 | 7886 |
| 5d7814b5-3ba8-4bc0-b3ea-74fa7168c416 | gluster-kube2-0.glusterd2.gcs | /var/run/glusterd2/bricks/pvc-30622ade-0f26-11e9-aaf6-525400933534/subvol1/brick2/brick | true | 39067 | 4115 |
| 2ea8fca7-e7e2-47e5-8f2f-8e6c399c50f4 | gluster-kube3-0.glusterd2.gcs | /var/run/glusterd2/bricks/pvc-30622ade-0f26-11e9-aaf6-525400933534/subvol1/brick3/brick | true | 35692 | 4034 |
+--------------------------------------+-------------------------------+-----------------------------------------------------------------------------------------+--------+-------+------+

After node reboot:
[root@gluster-kube1-0 /]# glustercli volume status pvc-30622ade-0f26-11e9-aaf6-525400933534
Volume : pvc-30622ade-0f26-11e9-aaf6-525400933534
+--------------------------------------+-------------------------------+-----------------------------------------------------------------------------------------+--------+-------+------+
| BRICK ID | HOST | PATH | ONLINE | PORT | PID |
+--------------------------------------+-------------------------------+-----------------------------------------------------------------------------------------+--------+-------+------+
| 2841d69f-8d1d-4013-bd6a-4aaea9031f9b | gluster-kube1-0.glusterd2.gcs | /var/run/glusterd2/bricks/pvc-30622ade-0f26-11e9-aaf6-525400933534/subvol1/brick1/brick | false | 0 | 0 |
| 5d7814b5-3ba8-4bc0-b3ea-74fa7168c416 | gluster-kube2-0.glusterd2.gcs | /var/run/glusterd2/bricks/pvc-30622ade-0f26-11e9-aaf6-525400933534/subvol1/brick2/brick | true | 39067 | 4115 |
| 2ea8fca7-e7e2-47e5-8f2f-8e6c399c50f4 | gluster-kube3-0.glusterd2.gcs | /var/run/glusterd2/bricks/pvc-30622ade-0f26-11e9-aaf6-525400933534/subvol1/brick3/brick | true | 35692 | 4034 |
+--------------------------------------+-------------------------------+-----------------------------------------------------------------------------------------+--------+-------+------+

atinmu · 2019-01-17T10:25:42Z

Taking this out from GCS/1.0 tag considering we're not going to make brick multiplexing a default option in GCS/1.0 release.

atinmu assigned oshankkumar Jan 3, 2019

atinmu added bug GCS/1.0 Issue is blocker for Gluster for Container Storage priority: high labels Jan 3, 2019

atinmu added brick-multiplexing-issue tracker label to capture all issues related to brick multiplexing feature and removed GCS/1.0 Issue is blocker for Gluster for Container Storage labels Jan 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bricks are failing to connect to the volume post gluster node reboot #1457

Bricks are failing to connect to the volume post gluster node reboot #1457

PrasadDesala commented Jan 3, 2019

PrasadDesala commented Jan 3, 2019

vpandey-RH commented Jan 3, 2019

PrasadDesala commented Jan 3, 2019

vpandey-RH commented Jan 3, 2019

vpandey-RH commented Jan 3, 2019

PrasadDesala commented Jan 3, 2019 •

edited

Loading

atinmu commented Jan 17, 2019

Bricks are failing to connect to the volume post gluster node reboot #1457

Bricks are failing to connect to the volume post gluster node reboot #1457

Comments

PrasadDesala commented Jan 3, 2019

Observed behavior

Expected/desired behavior

Details on how to reproduce (minimal and precise)

Information about the environment:

PrasadDesala commented Jan 3, 2019

vpandey-RH commented Jan 3, 2019

PrasadDesala commented Jan 3, 2019

vpandey-RH commented Jan 3, 2019

vpandey-RH commented Jan 3, 2019

PrasadDesala commented Jan 3, 2019 • edited Loading

atinmu commented Jan 17, 2019

PrasadDesala commented Jan 3, 2019 •

edited

Loading