`polykey agent stop` command not terminating properly #185

CryptoTotalWar · 2024-05-22T04:50:36Z

Describe the bug

Looking through the logs, looks like the hook for shutting down never triggered. The log cuts out mid line like the process was terminated. That must've been when you had to manually kill it.

This might be a mac specific problem.

To Reproduce

Run the agent
Wait for a few minutes
Run polykey agent stop

Test on Linux version that same way we got the error on Mac to see if it's an OS isolated incident or not.

Expected behavior

We handle most signals to trigger stopping the agent. The only thing that should really kill it like this is a SIGKILL signal.

Screenshots

Platform

Device: Mac
OS: Sonoma 14 Terminal Version 2.14
Version: polykey-cli-0.3.1-darwin-universal

Additional context

Polykey logs can be found here: Polykey-CLI#183
Propagated problem of connections being leaked causing push error flow being discussed here: Polykey-CLI#198

Notify maintainers

@tegefaulkes

The text was updated successfully, but these errors were encountered:

linear · 2024-05-22T04:50:39Z

ENG-322 Polykey agent stop command not terminating properly

tegefaulkes · 2024-06-17T04:34:53Z

I think I've seen this while testing recently as well. So it's a certain condition that can happen after running for a while that causes it. That makes it very hard to pin down.

tegefaulkes · 2024-06-20T04:24:00Z

I suspect that while the pkAgent.stop() is called, and the NodeConnectionManager is stopping with force. It might be possible that we are still handling incoming connections and RPC requests despite the stopping state.

So I'd check this first. Write a test that fires a lot of connections and RPC requests at a NodeConnectionManager and trigger a NodeConnectionManager.stop() with force set to true and false and just see what happens.

Right now I'm basing this off the fact that I caught it doing it in the act. I had a agent running with verbose for a while and triggered it to stop. It entered the stopping state, however it was still serving incoming connections and streams. If we enter a stopping state, forced or not, we should reject all new connections and streams that are attempted.

tegefaulkes · 2024-07-10T01:40:24Z

I tried approaching this from the outside in. Where I added a bunch of logging and debugging for testing and just waiting for the problem to happen so I can catch it in the act. This isn't working very well...

The running theory is that connection activity is causing a race condition that deadlocks the stopping procedure for networking. It's hard to say exactly what and where this is happening. So I have to work from the bottom up fixing up any potential problems that would cause it.

The usual suspects are the following libraries. This will create macrotasks and if any of them leak then we will fail to close the process after stopping. Also they could potentially deadlock when cleaning up.

js-quic, Holds a bunch of macro tasks and resources. If it fails to clean up properly then it will deadlock.
js-rpc, Similar but is a layer between transport and application code. Lifecycle is determined by the streams it consumes so it seems unlikely to deadlock from that. However it does maintain macrotasks in the form of timers.
js-mds, Similar to js-quic in that it maintains a socket and timers.
js-ws, Also maintains a socket and timers.

On top of the libraries, we have the the following domains that maintain connection lifecycles. These could potentially deadlock when cleaning up.

client
nodes

The main bit of evidence I have so far is...

When I caught the CLI failing to end when stopping I saw it still serving a bunch of agent-agent RPC calls.
It seems to happen more often with more nodes in the network. So more network traffic.
Seems to trigger more often when calling agent stop to trigger stopping.

So knowing this, the most likely suspect is that endless RPC calls are preventing the connection from stopping fully and deadlocking stop. To fix this, I need to add draining state handling to the client and nodes domain, along with adding it to each of the transport libraries we have. I can't guarantee that it will fully fix the issue but at the very least it will cross that potential problem off as the cause. So moving forward.

Look at the NodeConnectionManager domain and add logic that prevents accepting or creating new connections when stopping. Creating connections should already be prevented but we need to check reverse connections.
The NodeConnections need to reject new streams when in the stopping state. Creating new streams should already be prevented, but we need to prevent reverse streams.
Review client domain for any potential connection or stream leaks. Also add the same draining state logic that rejects new connections and streams.

From here we can work our way downwards as needed for the libraries.

js-quic needs draining state support. We need to reject new streams and connections, both forward and reverse when we're stopping.
js-ws needs the same treatment.
js-rpc doesn't create any connections but timeout timers might leak.
js-mdns I haven't worked on this directly, but it's usage of macrotasks can hold the process open. so we need to investigate it.

This is a fair amount of polish work that may not even solve the issue. But it's all stuff I've been meaning to do eventually. May as well get it done now.

CMCDragonkai · 2024-08-17T15:30:35Z

As per the comment #157 (comment), I'm not a fan of the Stopping Agent STDERR message, it seems unnecessary, and I don't see how it helps in debugging anything. If the prompt is not returned to the terminal, then stopping is broken. And in every case where the agent is running automatically it should be using --verbose anyway.

In the ideal case, the agent should stop perfectly, and also no need for this 3rd case message. In the unideal case, you just have to add an extra concurrent timer function using js-timer that runs, and reports a warning that the agent cannot stop - this should result in a trace of what exactly is holding it. In the case of nodejs, remember as a JS runtime, there's only 2 things that can hold the process open - any open IO fds, or infinite loops. And if we are able to run the concurrent function, then there can be no infinite loops. Then the only thing to trace is open FDs.

Node as a runtime, does not want to stay running. In fact we previously had an issue MatrixAI/Polykey#307 where node would sometimes just stop running, and we had no idea why. That turned out to be due to "promise deadlocks", and in fact we had an issue to try and trace promise deadlocks live, using the new async hooks api https://nodejs.org/api/async_hooks.html, we ended up not using it, but it's an important API for any concurrent trace debugging MatrixAI/js-logger#15, which we eventually want to collect together into a diagnostics domain: MatrixAI/Polykey#635

Whereas debugging the opposite is supposed to MUCH easier. As you can see by this SO issue: https://stackoverflow.com/questions/26057328/node-js-inspect-whats-left-in-the-event-loop-thats-preventing-the-script-from.

CMCDragonkai · 2024-08-17T15:32:47Z

So going forward:

Investigate open fds for causing the process to stay open. (REMEMBER THAT NODE DOES NOT ALLOW PROMISE DEADLOCKS TO KEEP THE PROCESS OPEN!) - @tegefaulkes so I don't think investigating any of the async operations makes sense in debugging this.
Get rid of that Stopping Agent message that was introduced from Add a shutting down message to polykey agent start command #157.
As part of the Polykey conceptual structure/planning we should be starting on Setting up diagnostics Domain for keeping track of some operational metrics Polykey#635 to deal with these situations - I think PK has reached a level of complexity requiring dedicated diagnostics.
The exit handler should be producing exit code of 0 when gracefully handled the signals like SIGINT and SIGTERM. It should be not be 130 and 143 respectively!

CMCDragonkai · 2024-08-23T03:48:12Z

If a graceful exit is halted by leaked connections. Then this is likely connected to #198 which appears to potentially be related to a remotely leaked connection that ends up causing a ungraceful exit of the agent.

CMCDragonkai · 2024-08-27T07:55:16Z

Get rid of that Stopping Agent message that was introduced from Add a shutting down message to polykey agent start command #157.

This can just be replaced by raising the info level to warning level for PolykeyAgent:Stopping Agent and PolykeyAgent: Stopped Agent.

tegefaulkes · 2024-08-27T22:49:04Z

I've created an issue at #270 to track that.

aryanjassal · 2024-12-12T22:34:02Z

My agent has been running in the background for about two days. As I have been working on #832 (RPC cancellation), I haven't been using the agent to run any commands. And when I tried to stop the agent, it was able to stop fairly quickly (under 3 seconds).

This means that most likely the issue is coming from a command and not from any background tasks that are run without input. I guess we can run each command once and try to shut down the agent to see which command causes the leaks, but that will be very time consuming.

Adding proper cancellation to all RPC commands should help circumvent this issue to an extent.

CMCDragonkai · 2024-12-13T17:09:05Z

All you need to do is to track resource counts using resource counter and you'll see what's leaking instead of trying to blackbox this.

aryanjassal · 2025-01-14T02:50:53Z

I was running a temporary local node on my machine for testing purposes, and I encountered this log message before the agent got stuck on shutting down. This is using the latest staging for Polykey CLI. Before I stopped the agent, I attempted a vaults clone operation which failed with a timeout. I'm not sure if the timeout failure is relevant, but could be useful.

The log messages show which task failed to stop, so this could be useful to help pinpoint the issue and finally resolve it.

^CWARN:polykey:Stopping Agent
WARN:polykey.PolykeyAgent:Failed to stop task (NodeManager.refreshBucketHandler) after 10000ms
WARN:polykey.PolykeyAgent:Failed to stop task (NodeManager.refreshBucketHandler) after 10000ms
WARN:polykey.PolykeyAgent:Failed to stop task (NodeManager.refreshBucketHandler) after 10000ms
WARN:polykey.PolykeyAgent:Failed to stop task (NodeManager.refreshBucketHandler) after 10000ms
WARN:polykey.PolykeyAgent:Failed to stop task (NodeManager.refreshBucketHandler) after 10000ms
WARN:polykey.PolykeyAgent:Failed to stop task (NodeManager.refreshBucketHandler) after 10000ms
WARN:polykey.PolykeyAgent:Failed to stop task (NodeManager.refreshBucketHandler) after 10000ms
WARN:polykey.PolykeyAgent:Failed to stop task (NodeManager.refreshBucketHandler) after 10000ms
WARN:polykey.PolykeyAgent:Failed to stop task (NodeManager.refreshBucketHandler) after 10000ms
WARN:polykey.PolykeyAgent:Failed to stop task (NodeManager.refreshBucketHandler) after 10000ms
WARN:polykey.PolykeyAgent:Failed to stop task (NodeManager.refreshBucketHandler) after 10000ms
WARN:polykey.PolykeyAgent:Failed to stop task (NodeManager.refreshBucketHandler) after 10000ms

aryanjassal · 2025-01-16T02:10:54Z

I got something similar again, but this time, on the main node I operate. This time, it had different handlers which failed to stop on time.

WARN:polykey:Stopping Agent
WARN:polykey.PolykeyAgent:Failed to stop task (NodeManager.checkConnectionsHandler) after 10000ms
WARN:polykey.PolykeyAgent:Failed to stop task (NodeManager.refreshBucketHandler) after 10000ms
WARN:polykey.PolykeyAgent:Failed to stop task (NodeManager.refreshBucketHandler) after 10000ms
WARN:polykey.PolykeyAgent:Failed to stop task (NodeManager.refreshBucketHandler) after 10000ms
WARN:polykey.PolykeyAgent:Failed to stop task (NodeManager.refreshBucketHandler) after 10000ms

CMCDragonkai · 2025-01-20T21:03:23Z

Debugging these are going to require you to dig inwards and observe the concurrent code, rather than just sitting on the outside of the black box.

aryanjassal · 2025-02-04T01:23:32Z

After implementing the RPC cancellation, I observe node getting stuck on streams rarely now. Of course, if other nodes aren't updated, then those streams will be held open, so it won't make a difference if our client is updated.

Brian, Brynley, and I have been using the latest version for a bit and have not observed the process hanging on shutdown. I have also seen that Brian has made changes to the source code, adding ctx to the refreshBucketHandler, so I am yet to see this issue after updating. Perhaps the open streams were partially related to tasks failing to stop?

Anyways, I haven't seen this issue happen in the recent times. I'll keep an eye on the status of nodes for the seednodes and Brynley's and Soorya's nodes.

aryanjassal · 2025-02-04T01:36:38Z

I have asked Brynley and Brian to run the command polykey nodes connections to show me the connections they currently have, and they are all closing properly; no held stream was visible. I asked Brynley to clone a vault which was too big, causing a RPC timeout. This previously leaked a stream, but it shut down properly this time.

As I mentioned before, if even a single person is on the older version, then they can leak streams to other nodes which will hold them open. This might be a potential attack vector in the future, so we might need a way to force-close streams after a period of inactivity irrelevant if the streams are being held open by a process.

Anyways, if no one encounters this issue on the latest version (["0.16.13","1.18.0","1","1"]), then this issue can be considered resolved.

aryanjassal · 2025-02-04T04:58:10Z

This issue hasn't been fully resolved yet, only been minimised. In rare cases, the agent can still be stuck in the shutdown state. Eventually the agent resolves the open streams, but they still remain open. When that happened, I got more warnings from task manager that refreshBucketHandler wasn't stopped properly. This means that, even though cancellation has been added, it still doesn't work perfectly for that task and it needs investigation. That should be the next target of optimisation and fixing.

aryanjassal · 2025-02-06T03:47:52Z

Brian also ran a check on his active connections, which displayed 4 active streams. After he attempted to terminate the agent, there were four warnings that NodeManager.refreshBucketHandler did not shut down within 10 seconds. So, I can confidently say that a major reason for the agent being stuck was the background tasks not cancelling properly.

After investigation, I realised that it already had cancellation built-in. However, that cancellation wasn't effective, as apparent from the tasks failing to stop. So, I am adding more rigorous cancellation to basically every async operation.

To test this out, I will wait until a stream is locked, then attempt to stop the agent. If the agent shuts down gracefully, then this might be the solution for this issue.

tegefaulkes · 2025-02-10T02:14:53Z

Just adding a note here. With the recent feature of network segregation added to Polykey. I'm finding that I can pretty reliably trigger this problem with the agent not terminating properly.

When trying to use a connection that hasn't authenticated we will reliably end up with an authentication error. So this is making errors being thrown when using a connection much more likely now. I suspect these errors are related to the problem happening.

Regardless, with the new changes it should be very easy to catch examples of the problem happening. It should make debugging easier. @aryanj you should use the latest polykey while working on this.

aryanjassal · 2025-02-11T06:46:19Z

I have revisited all the handlers and updated them to handle cancellation more rigorously. After quick testing, I can see that the agent no longer hangs while shutting down. I believe this should have resolved the issue. I will keep an eye on it and if the agent still doesn't shut down properly as of version ["0.17.1","1.21.0","1","1"] or later, then I might re-open this issue.

CryptoTotalWar added the bug Something isn't working label May 22, 2024

CryptoTotalWar self-assigned this May 22, 2024

CryptoTotalWar changed the title ~~Polykey agent stop command not terminating properly~~ Polykey-CLI: agent stop command not terminating properly May 22, 2024

CryptoTotalWar assigned tegefaulkes and unassigned CryptoTotalWar Jun 16, 2024

tegefaulkes assigned amydevs and unassigned tegefaulkes Jun 20, 2024

tegefaulkes unassigned amydevs Jul 26, 2024

CMCDragonkai added the r&d:polykey:core activity 1 Secret Vault Sharing and Secret History Management label Aug 15, 2024

CMCDragonkai mentioned this issue Aug 17, 2024

Add a shutting down message to polykey agent start command #157

Closed

CMCDragonkai mentioned this issue Aug 23, 2024

Node crashes/exits ungracefully on uncaught exceptions/events relating to QUIC connections or QUIC streams #198

Closed

CMCDragonkai added r&d:polykey:core activity 4 End to End Networking behind Consumer NAT Devices and removed r&d:polykey:core activity 1 Secret Vault Sharing and Secret History Management labels Aug 23, 2024

tegefaulkes mentioned this issue Aug 23, 2024

Server stream progress updates for RPC calls that have long running async operations #264

Open

CMCDragonkai mentioned this issue Nov 1, 2024

Failed vault clone or vault pull on N1 causes N2 to crash #324

Closed

aryanjassal changed the title ~~Polykey-CLI: agent stop command not terminating properly~~ polykey agent stop command not terminating properly Nov 21, 2024

aryanjassal mentioned this issue Dec 13, 2024

Add support for cancellation of potentially long-running operations MatrixAI/js-encryptedfs#86

Open

aryanjassal mentioned this issue Jan 16, 2025

Memory leak in agent when re-running failed tasks #333

Closed

aryanjassal self-assigned this Feb 6, 2025

aryanjassal mentioned this issue Feb 10, 2025

Adding cancellation to background handlers to prevent agent from being held open MatrixAI/Polykey#872

Merged

10 tasks

aryanjassal closed this as completed in MatrixAI/Polykey#872 Feb 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`polykey agent stop` command not terminating properly #185

`polykey agent stop` command not terminating properly #185

CryptoTotalWar commented May 22, 2024 •

edited by aryanjassal

Loading

linear bot commented May 22, 2024

tegefaulkes commented Jun 17, 2024

tegefaulkes commented Jun 20, 2024 •

edited

Loading

tegefaulkes commented Jul 10, 2024 •

edited

Loading

CMCDragonkai commented Aug 17, 2024 •

edited

Loading

CMCDragonkai commented Aug 17, 2024 •

edited

Loading

CMCDragonkai commented Aug 23, 2024

CMCDragonkai commented Aug 27, 2024

tegefaulkes commented Aug 27, 2024

aryanjassal commented Dec 12, 2024

CMCDragonkai commented Dec 13, 2024

aryanjassal commented Jan 14, 2025

aryanjassal commented Jan 16, 2025

CMCDragonkai commented Jan 20, 2025

aryanjassal commented Feb 4, 2025

aryanjassal commented Feb 4, 2025

aryanjassal commented Feb 4, 2025

aryanjassal commented Feb 6, 2025

tegefaulkes commented Feb 10, 2025

aryanjassal commented Feb 11, 2025

polykey agent stop command not terminating properly #185

polykey agent stop command not terminating properly #185

Comments

CryptoTotalWar commented May 22, 2024 • edited by aryanjassal Loading

Describe the bug

To Reproduce

Expected behavior

Screenshots

Platform

Additional context

Notify maintainers

linear bot commented May 22, 2024

tegefaulkes commented Jun 17, 2024

tegefaulkes commented Jun 20, 2024 • edited Loading

tegefaulkes commented Jul 10, 2024 • edited Loading

CMCDragonkai commented Aug 17, 2024 • edited Loading

CMCDragonkai commented Aug 17, 2024 • edited Loading

CMCDragonkai commented Aug 23, 2024

CMCDragonkai commented Aug 27, 2024

tegefaulkes commented Aug 27, 2024

aryanjassal commented Dec 12, 2024

CMCDragonkai commented Dec 13, 2024

aryanjassal commented Jan 14, 2025

aryanjassal commented Jan 16, 2025

CMCDragonkai commented Jan 20, 2025

aryanjassal commented Feb 4, 2025

aryanjassal commented Feb 4, 2025

aryanjassal commented Feb 4, 2025

aryanjassal commented Feb 6, 2025

tegefaulkes commented Feb 10, 2025

aryanjassal commented Feb 11, 2025

`polykey agent stop` command not terminating properly #185

`polykey agent stop` command not terminating properly #185

CryptoTotalWar commented May 22, 2024 •

edited by aryanjassal

Loading

tegefaulkes commented Jun 20, 2024 •

edited

Loading

tegefaulkes commented Jul 10, 2024 •

edited

Loading

CMCDragonkai commented Aug 17, 2024 •

edited

Loading

CMCDragonkai commented Aug 17, 2024 •

edited

Loading