Project Call nodes stalling #74

knolleary · 2024-05-28T09:50:29Z

Current Behavior

Reported by a self-hosted customer - originally here: #68 (comment)

We have been experiencing a problem of return messages suddenly not being delivered for some time now. Multiple versions of NodeRED, multiple versions of Project nodes, multiple versions of NodeJS (on the Windows server). We are unable to reliably recreate the problem. It appears to surface after some time of usage or # of messages (?) of the Project Call node. But hesitant to say this because I've experienced a stall after just 7 messages, while I've also seen it do 60+ without a problem.
A (manual) restart of the calling instance (where the Project Call node is) solves the problem. Obviously, this isn't desirable in a production environment.

With an updated from the end of last week:

It does appear we are getting a time-out from the project call node when no return is received. The problem also appears to rear its head more when multiple people are working with the Dashboard front-end (triggering flows that use the project calls to start Powershell scripts on the Windows based NR instance), and in the case of yesterday he was constantly starting actions. This leads us to believe the number of messages sent over a project call node plays a role here.

Expected Behavior

No response

Steps To Reproduce

No response

Environment

FlowFuse version:
Node.js version:
npm version:
Platform/OS:
Browser:

Linked Customers

Customer name and/or link to HubSpot contact

Steve-Mcl · 2024-05-28T14:54:13Z

For context, see discussion originally posted here: #68 (comment)

We are using a (1 at this time, we plan to have several) Project Call node to send msg's to a NodeRED instance that is running on a Windows server, running FlowFuse Agent. A flow on the Windows server instance is used to run Powershell scripts, which also have an output that is sent back to the calling flow over the Project Out return.

We have been experiencing a problem of return messages suddenly not being delivered for some time now

And the only remedy when that happens is to restart the layer from which the project calls originate.
While messages sent via a second project call (to the same instance as where the other one stalls) keeps working perfectly.

@SynoUser-NL

Would you be able to share demo flows from the projectlink-call and the subroutine (link-in~...~link-return) project?

Assuming it is not a huge amount, please include all nodes leading up to the projectlink-call and beyond AND all nodes between the the link-in~...~link-return nodes. Please also include any debug nodes you have added that you use for verifing the call was sent/received/returned (though do be sure to sanitise or obfuscate anything sensitive).

Thanks, Steve.

knolleary · 2024-05-28T15:35:44Z

Adding another update provided by the customer:

We are now able to see which message was sent last (using a queue) to the Project Call node that times out.
I've also enabled a Dashboard on the Windows instance that shows us the last message sent to the Project Out\Return node.
The time-out flow now triggers a message to a Project Test call (to the same Windows instance) and adds the message return time.
When the main Project Call node stops responding, the test Project Call remains in working order.

It appears to us at this time that the Project Call node sometimes does not "catch" the return message sent by a Project Out return node. Thus failing to resume the flow that is built.

robmarcer · 2024-07-03T13:00:24Z

Some feedback from a FlowFuse user - https://app-eu1.hubspot.com/contacts/26586079/record/0-1/8977201

Hi Support,

I am confident this is the same as an existing customer reported problem here: "Project Call nodes stalling #74" - https://github.com/orgs/FlowFuse/projects/1?pane=issue&itemId=64911130

I just wanted to register an interest in the resolution. Also perhaps I can add some extra information, though you decide.

I've traced this through from logs on the ff-agent device and logs on the ultimate endpoint.

Here's a snippet to illustrate, from the agent hosted node-red that makes the link call:

2024/07/03 10:44:34Z WARN FlowFuse: Disconnected
2024/07/03 10:44:35Z INFO FlowFuse: Connected
2024/07/03 10:45:30Z WARN flowfuse server not answering
2024/07/03 10:45:30Z INFO Stored telemetry for replay 2024-07-03T10:45:00.000Z
2024/07/03 10:49:30Z INFO Flowfuse server: We'll try again...
2024/07/03 10:49:30Z INFO Re-submitted telemetry for 2024-07-03T10:45:00.000Z

I left the link call with a rather generous 30 second time out, bold above to show that's when the link code returned to the flow with a timeout that I catch. The posting would have occurred at 10:45:00 (ish).

I can tell you that in this case ff-cloud delivered the original posting to the endpoint at 10:45:14 - this in itself is unusual, as normally that data passes though without meaningful delay (within the same second, on the endpoint I do not have access to finer grained times). Unfortunately I need to do some work on the server instance to better log when something occurs that is not to plan, so I cannot yet tell you if there was a long delay in the posting hitting the link-in node.

The 10:49:30 posting completed in normal time, no delays.

These are not isolated incidences. And seem more common than a couple of months back (but I may not have been looking closely enough, so not sure).

I updated the project nodes packages to 0.7.0, NR is v3.1.10 ff-cloud, in in the example above, v3.1.9 for the agent.

Please let me know if you find a resolution for this.

The double postings are not too problematic right now, but ultimately I need to stop them. So the link response disappearing means I cannot know what was or was not actioned.

Steve-Mcl · 2024-07-09T16:58:15Z

For clarification, the issue mentioned above by @robmarcer is on FlowFuse cloud not self hosted (like the OP)

In the instance above, I have checked the dates provided (both here and in other correspondence) and can see no correlation with CI/CD core app restarts.

I am awaiting a meet up with a FF Cloud customer who will be walking me through their process and I will be able to get finer details (like instance/device IDs) for trawling our logs with finer focus than i have been able to during todays investigations.

knolleary added the needs-triage Needs looking at to decide what to do label May 28, 2024

knolleary added this to ☁️ Product Planning May 28, 2024

knolleary added bug Something isn't working priority:high High Priority and removed needs-triage Needs looking at to decide what to do labels May 28, 2024

knolleary added this to 🛠 Development May 28, 2024

knolleary moved this to Up Next in 🛠 Development May 28, 2024

knolleary assigned Steve-Mcl May 28, 2024

Steve-Mcl mentioned this issue Sep 2, 2024

Introduce short, random reconnection delay #100

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Project Call nodes stalling #74

Project Call nodes stalling #74

knolleary commented May 28, 2024

Steve-Mcl commented May 28, 2024

knolleary commented May 28, 2024

robmarcer commented Jul 3, 2024

Steve-Mcl commented Jul 9, 2024

Project Call nodes stalling #74

Project Call nodes stalling #74

Comments

knolleary commented May 28, 2024

Current Behavior

Expected Behavior

Steps To Reproduce

Environment

Linked Customers

Steve-Mcl commented May 28, 2024

knolleary commented May 28, 2024

robmarcer commented Jul 3, 2024

Steve-Mcl commented Jul 9, 2024