Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Communication with witnesses hangs if one witness is not responding #814

Open
rodolfomiranda opened this issue Jul 5, 2024 · 1 comment
Assignees
Labels
bug Something isn't working triage

Comments

@rodolfomiranda
Copy link
Contributor

rodolfomiranda commented Jul 5, 2024

Version

all

Environment

Witness Deployment

Expected behavior

When Receiptor sends events to the list of witnesses, it should try to communicate with all witnesses even if one of them is unresponsive.

Actual behavior

Current logic in Receiptor

for wit, client in clients.items():
headers = dict()
if wit in auths:
headers["Authorization"] = auths[wit]
httping.streamCESRRequests(client=client, dest=wit, ims=bytearray(msg), path="receipts", headers=headers)
while not client.responses:
yield self.tock
rep = client.respond()
if rep.status == 200:
rct = bytearray(rep.body)
hab.psr.parseOne(bytearray(rct))
rserder = serdering.SerderKERI(raw=rct)
del rct[:rserder.size]
# pull off the count code
coring.Counter(qb64b=rct, strip=True)
rcts[wit] = rct
else:
print(f"invalid response {rep.status} from witnesses {wit}")

keeps in an infinite loop in line 96 if one witness is unresponsive, preventing the loop over witnesses (line 90) to iterate to the rest.
Ideally the client should timeout, instead it tries forever, at least using demo witnesses.
One option is to add a timeout in line 96.

Steps to reproduce

Create a single sig AID with toad=1 and two witnesses , with one of them disconnected.

Using kli incept with the --receipt-endpoint flag forces to use the Receiptor. It can reproduced with the following steps

1- start demo witness kli witness demo
2- init and resolve to witness oobis, one from the demo wits and one on the cloud

kli init --name local  --nopasscode
kli oobi resolve --name local --oobi http://127.0.0.1:5642/oobi/BBilc4-L3tFUnfM_wJr4S4OJanAv_VmF_dJNN6vkf2Ha/controller --oobi-alias witdemo
kli oobi resolve --name local --oobi http://witness1.dev.provenant.net:5631/oobi/BCf29L_7oQtU8WUXEV2Bi5sf7WoxnGyX7sgJSym-p4Pp/controller --oobi-alias wit cloud

3- test incepting with toad=1 and both witnesses up

kli incept --name local --alias aid1  -w BCf29L_7oQtU8WUXEV2Bi5sf7WoxnGyX7sgJSym-p4Pp -w BBilc4-L3tFUnfM_wJr4S4OJanAv_VmF_dJNN6vkf2Ha   --toad 1 --icount 1 --isith 1 --ncount 1 --nsith 1 --transferable --receipt-endpoint
kli status --name local --alias aid1
Alias: 	aid1
Identifier: EC_vFyYXNLRTlXiTU8ON5m6g4QF1xQxkuejpShFiZbRT
Seq No:	0

Witnesses:
Count:		2
Receipts:	2
Threshold:	1

Public Keys:	
	1. DBELZgD9ZEn9wRwJ3XRxyFo6mFMqAb4poQ1fEDznXNuH

4- stop demo witnesses (ctrl-c)
5- incept a new aid

kli incept --name local --alias aid2  -w BBilc4-L3tFUnfM_wJr4S4OJanAv_VmF_dJNN6vkf2Ha   -w BCf29L_7oQtU8WUXEV2Bi5sf7WoxnGyX7sgJSym-p4Pp --toad 1 --icount 1 --isith 1 --ncount 1 --nsith 1 --transferable --receipt-endpoint
you need to ctrl-c to stop the loop
kli status --name local --alias aid2
Alias: 	aid2
Identifier:EKWiV3XUA90-0UJLbRuqPNiomjbIu57imJqHhmJy1req
Seq No:	0

Witnesses:
Count:		2
Receipts:	0
Threshold:	1

Note that no receipts were received even though one witness is up and toad=1
6- incept a new aid with the witnesses in different order

kli incept --name local --alias aid3  -w BCf29L_7oQtU8WUXEV2Bi5sf7WoxnGyX7sgJSym-p4Pp -w BBilc4-L3tFUnfM_wJr4S4OJanAv_VmF_dJNN6vkf2Ha   --toad 1 --icount 1 --isith 1 --ncount 1 --nsith 1 --transferable --receipt-endpoint
you need to ctrl-c to stop the loop
kli status --name local --alias aid3
Alias: 	aid3
Identifier:EENZlZ9DNhNFlga_BXIOv2NQ5KVbnxzow4aNGTupizin
Seq No:	0

Witnesses:
Count:		2
Receipts:	1
Threshold:	1

Note that now 1 receipt is received because the witness that is up is first in the loop.

@rodolfomiranda rodolfomiranda added bug Something isn't working triage labels Jul 5, 2024
@m00sey m00sey self-assigned this Jul 8, 2024
@rodolfomiranda
Copy link
Contributor Author

We may encounter the infinite loop also in:

Receiptor
WitnessReceiptor
WitnessInquisitor
WitnessPublisher
TCPMessanger
TCPStreamMessanger
HTTPMessager

One option may be to pass a timeout value to those clases to kill loops after a while.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage
Projects
None yet
Development

No branches or pull requests

2 participants