Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Project Call nodes stalling #74

Open
knolleary opened this issue May 28, 2024 · 4 comments
Open

Project Call nodes stalling #74

knolleary opened this issue May 28, 2024 · 4 comments
Assignees
Labels
bug Something isn't working priority:high High Priority

Comments

@knolleary
Copy link
Member

Current Behavior

Reported by a self-hosted customer - originally here: #68 (comment)

We have been experiencing a problem of return messages suddenly not being delivered for some time now. Multiple versions of NodeRED, multiple versions of Project nodes, multiple versions of NodeJS (on the Windows server). We are unable to reliably recreate the problem. It appears to surface after some time of usage or # of messages (?) of the Project Call node. But hesitant to say this because I've experienced a stall after just 7 messages, while I've also seen it do 60+ without a problem.
A (manual) restart of the calling instance (where the Project Call node is) solves the problem. Obviously, this isn't desirable in a production environment.

With an updated from the end of last week:

It does appear we are getting a time-out from the project call node when no return is received. The problem also appears to rear its head more when multiple people are working with the Dashboard front-end (triggering flows that use the project calls to start Powershell scripts on the Windows based NR instance), and in the case of yesterday he was constantly starting actions. This leads us to believe the number of messages sent over a project call node plays a role here.

Expected Behavior

No response

Steps To Reproduce

No response

Environment

  • FlowFuse version:
  • Node.js version:
  • npm version:
  • Platform/OS:
  • Browser:

Linked Customers

  • Customer name and/or link to HubSpot contact
@knolleary knolleary added the needs-triage Needs looking at to decide what to do label May 28, 2024
@knolleary knolleary added bug Something isn't working priority:high High Priority and removed needs-triage Needs looking at to decide what to do labels May 28, 2024
@knolleary knolleary moved this to Up Next in 🛠 Development May 28, 2024
@Steve-Mcl
Copy link
Contributor

For context, see discussion originally posted here: #68 (comment)

We are using a (1 at this time, we plan to have several) Project Call node to send msg's to a NodeRED instance that is running on a Windows server, running FlowFuse Agent. A flow on the Windows server instance is used to run Powershell scripts, which also have an output that is sent back to the calling flow over the Project Out return.

We have been experiencing a problem of return messages suddenly not being delivered for some time now

And the only remedy when that happens is to restart the layer from which the project calls originate.
While messages sent via a second project call (to the same instance as where the other one stalls) keeps working perfectly.

@SynoUser-NL

Would you be able to share demo flows from the projectlink-call and the subroutine (link-in~...~link-return) project?

Assuming it is not a huge amount, please include all nodes leading up to the projectlink-call and beyond AND all nodes between the the link-in~...~link-return nodes. Please also include any debug nodes you have added that you use for verifing the call was sent/received/returned (though do be sure to sanitise or obfuscate anything sensitive).

Thanks, Steve.

@knolleary
Copy link
Member Author

Adding another update provided by the customer:

We are now able to see which message was sent last (using a queue) to the Project Call node that times out.
I've also enabled a Dashboard on the Windows instance that shows us the last message sent to the Project Out\Return node.
The time-out flow now triggers a message to a Project Test call (to the same Windows instance) and adds the message return time.
When the main Project Call node stops responding, the test Project Call remains in working order.

It appears to us at this time that the Project Call node sometimes does not "catch" the return message sent by a Project Out return node. Thus failing to resume the flow that is built.

@robmarcer
Copy link

Some feedback from a FlowFuse user - https://app-eu1.hubspot.com/contacts/26586079/record/0-1/8977201

Hi Support,

I am confident this is the same as an existing customer reported problem here: "Project Call nodes stalling #74" - https://github.com/orgs/FlowFuse/projects/1?pane=issue&itemId=64911130

I just wanted to register an interest in the resolution. Also perhaps I can add some extra information, though you decide.

I've traced this through from logs on the ff-agent device and logs on the ultimate endpoint.

Here's a snippet to illustrate, from the agent hosted node-red that makes the link call:

2024/07/03 10:44:34Z WARN FlowFuse: Disconnected
2024/07/03 10:44:35Z INFO FlowFuse: Connected
2024/07/03 10:45:30Z WARN flowfuse server not answering
2024/07/03 10:45:30Z INFO Stored telemetry for replay 2024-07-03T10:45:00.000Z
2024/07/03 10:49:30Z INFO Flowfuse server: We'll try again...
2024/07/03 10:49:30Z INFO Re-submitted telemetry for 2024-07-03T10:45:00.000Z

I left the link call with a rather generous 30 second time out, bold above to show that's when the link code returned to the flow with a timeout that I catch. The posting would have occurred at 10:45:00 (ish).

I can tell you that in this case ff-cloud delivered the original posting to the endpoint at 10:45:14 - this in itself is unusual, as normally that data passes though without meaningful delay (within the same second, on the endpoint I do not have access to finer grained times). Unfortunately I need to do some work on the server instance to better log when something occurs that is not to plan, so I cannot yet tell you if there was a long delay in the posting hitting the link-in node.

The 10:49:30 posting completed in normal time, no delays.

These are not isolated incidences. And seem more common than a couple of months back (but I may not have been looking closely enough, so not sure).

I updated the project nodes packages to 0.7.0, NR is v3.1.10 ff-cloud, in in the example above, v3.1.9 for the agent.

Please let me know if you find a resolution for this.

The double postings are not too problematic right now, but ultimately I need to stop them. So the link response disappearing means I cannot know what was or was not actioned.

@Steve-Mcl
Copy link
Contributor

For clarification, the issue mentioned above by @robmarcer is on FlowFuse cloud not self hosted (like the OP)

In the instance above, I have checked the dates provided (both here and in other correspondence) and can see no correlation with CI/CD core app restarts.

I am awaiting a meet up with a FF Cloud customer who will be walking me through their process and I will be able to get finer details (like instance/device IDs) for trawling our logs with finer focus than i have been able to during todays investigations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working priority:high High Priority
Projects
Status: No status
Status: Up Next
Development

No branches or pull requests

3 participants