Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cause 100% CPU consumption when mini server is at Config 15.3 Beta 3 #85

Open
hoschult opened this issue Dec 1, 2024 · 22 comments
Open

Comments

@hoschult
Copy link

hoschult commented Dec 1, 2024

module seems to be not working when running current beta (Config 15.3 Beta 3). The problem looks like:
(1) node red process consumes 100% CPU even if there are no flows activated
(2) structure file is recieved, however selector in nodes does not respect the correlation (a switch in room xyz)
To isolate the issue I gave the mini server config node an ip address which is not reachable and redeployed the flow - CPU utilization of 100% was gone away; Switching back to the working ip address, high cpu load comes back.
How to reproduce the problem:
a) take a brand new node red instance
b) install node-red-contrib-loxone
c) connect to a mini server which is running at Config 15.3 Beta 3
d) observe looping node-red process with top etc... on os level (consumes about 100% cpu)

Loxone Config 15.3 Beta 2 was ok --> I'm not sure if issue is on Loxone side. However (and from my point of view), node-red-contrib-loxone should be robust enough to not go in nested loops (causing 100% cpu utilization) when receiving suspicious Loxone responses.

@codmpm
Copy link
Owner

codmpm commented Dec 2, 2024

Hey Holger,

you should be right. Could you get me the error message from node-red-log?

Cheers,
Patrik

@hoschult
Copy link
Author

hoschult commented Dec 2, 2024

Hi Patrik,
there is no error message in node-red-log; In the cause of symptoms (of consuming 100% cpu load), I receive mini server connection timeouts resulting in new connection requests. As a user, I observe that the actions when I turn on a light (switch on Loxone app triggers a flow which creates a API requests towards shelly i.e.) is delayed by 5-10 seconds --> is logical because node-red process is totally loaded.
Is there a trick to let the node-red-contrib-loxone do some debug so that I can find out where the nested loops are being done? Already tried to use some profiles.... strace is saying the following:

root@raspimain:~# strace -p 26573 -c
strace: Process 26573 attached
^Cstrace: Process 26573 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 66.36    0.057999          11      5078           clock_gettime64
 10.41    0.009095          57       159           writev
  7.69    0.006719          39       170         2 futex
  4.41    0.003857          43        88           sendmsg
  2.73    0.002390          88        27         3 epoll_ctl
  2.48    0.002171          22        98           write
  1.76    0.001534          23        66           read
  1.61    0.001409         128        11           epoll_wait
  1.00    0.000874         124         7         7 connect
  0.64    0.000559          69         8           close
  0.23    0.000201           6        31           setsockopt
  0.18    0.000156           4        36           getpid
  0.18    0.000155          51         3           shutdown
  0.14    0.000126          18         7           socket
  0.14    0.000120          30         4         2 accept4
  0.04    0.000031           4         7           getsockopt
  0.00    0.000000           0         8           mprotect
  0.00    0.000000           0         1           getpeername
  0.00    0.000000           0         3           cacheflush
------ ----------- ----------- --------- --------- ----------------
100.00    0.087396          15      5812        14 total

@codmpm
Copy link
Owner

codmpm commented Dec 2, 2024

Hmmm. That's strange. As typically there should be something to point in the direction where the problem occurs.

Will have to install beta 3 to check for myself.

@hoschult
Copy link
Author

hoschult commented Dec 2, 2024

yes, same view. I have opened up a Loxone ticket; They told me that it's hard to do a downgrade of the firmware (just for your attention). Furthermore, there will be a Beta 4 arriving today or tomorrow; Will keep you updated.

@codmpm
Copy link
Owner

codmpm commented Dec 2, 2024

Oh, good to know.
Could you test with beta 4?

If the problem persists I should change node-red-contrib-loxone. But of course this makes no sense for betas ;-)

@hoschult
Copy link
Author

hoschult commented Dec 2, 2024

sure, will do. Searched around what clock_gettime64 from strace means (because this is the top scored call). Use case is a timing on, i.e.:
Commonly used for high-resolution timing and performance measurements.
Do you have any timing relevant code snippets in the module code?

@codmpm
Copy link
Owner

codmpm commented Dec 2, 2024

Do you have any timing relevant code snippets in the module code?

Not that I'm aware of.

@hoschult
Copy link
Author

hoschult commented Dec 2, 2024

beta 4 was released and do have unfortunately the same load issue (100% CPU allocation on node-red process). Did some investigation with node.js profiler. Would like to share the following top top ticks which were accounted:

image

Steps to do the profiling:
(1) start node red in profiling mode:
NODE_ENV=production node --prof /usr/lib/node_modules/node-red/red.js
(2) analyze the trace being created:
node --prof-process isolate-0x5a1f408-28832-v8.log > processed.txt

@hoschult
Copy link
Author

hoschult commented Dec 2, 2024

had a look at my LoxAPP3.json which comes from the miniserver; There are a lot of uuid resources defined; Maybe above shared profiles ticks comes from nested loops within node-lox-ws-api? The relevant code chapters are describing a couple of foreach like this:

this.connection.on('message_event_table_values', function(messages) {
   messages.forEach(function(evt){
       that.emit('update_event_value', evt.uuid.string, evt.value);
       that.emit('update_event_value_' + evt.uuid.string, evt.value);
   });
   that.emit('message_event_table_values', messages);

});

Only an assumption....

@codmpm
Copy link
Owner

codmpm commented Dec 2, 2024

I have to look into this but my time is currently limited.
Feel free to open a pull request.
As we already forked node-lox-ws-api there should be no problem to incorporate a fix there.

@hoschult
Copy link
Author

hoschult commented Dec 3, 2024

some updates..... node.js inspection (via chrome dev tools profiler visualizer) says that process is busy all the time with running "processImmediate" (violett blocks in the screenshot of the flow bars):

image

Drilling down into the processing tree of re-occuring tasks (_updateEvent) points to the above mentioned code chapters:

image

which ends up at:

image

in loxone.js

this calls the expensive function in API.js:

image

Feel free to open a pull request.

try to support as best as possible, however I have doubts if my coder skills are strong enough :-(

@codmpm
Copy link
Owner

codmpm commented Dec 3, 2024

As 15.3 should be releases today, could you check it again with the final release?
Of course I have to fix the bug then...

@hoschult
Copy link
Author

hoschult commented Dec 3, 2024

now running on 15.3 (no beta any more), however same behavior :-( Some thoughts: Issue came to my attention between beta2 and beta3. In this step, they introduced MQTT client; Maybe they have changed the structure or how they operate on events/messages?

My IDE advised me to use the following code instead (at API.js beginning in line 127) to reduce emit calls and to stop foreach traveling when matching pair was found:

this.connection.on('message_event_table_values', function (messages) {
    const updateEvents = [];
    const specificUpdates = {};

    // Efficiently collect events
    for (const evt of messages) {
        updateEvents.push({ uuid: evt.uuid.string, value: evt.value });
        specificUpdates[`update_event_value_${evt.uuid.string}`] = evt.value;
    }

    // Batch emit operation
    that.emit('update_event_values', updateEvents);

    // Emit specific UUID updates
    for (const [eventKey, value] of Object.entries(specificUpdates)) {
        that.emit(eventKey, value);
    }

    // Emit the entire message
    that.emit('message_event_table_values', messages);
});

@hoschult
Copy link
Author

hoschult commented Dec 3, 2024

see this pull request. CPU load has gone (good :-) ).... However delay in events stays; 5-10 seconds after loxone switch was triggered, message appeared in node red. Not acceptable user experience :-(

@hoschult
Copy link
Author

hoschult commented Dec 3, 2024

while testing, seeing some events which are not being catched anymore: i.e. RGB light circuit in light module (lichtbaustein)

@codmpm
Copy link
Owner

codmpm commented Dec 3, 2024

Thank you so much. Will have to look into this....

@bausi2k
Copy link

bausi2k commented Dec 4, 2024

thx for the prework - so upgrading to 15.3 public isn't recommended at the moment, if you rely on nodered working with Loxone

@hoschult
Copy link
Author

hoschult commented Dec 4, 2024

yes, at least in my humble opinion it's better to stay a moment on the older branch (especially because downgrade of miniserver is not that simple according to Loxone) .... Let's wait until Patrick can take a look into it.

@bausi2k
Copy link

bausi2k commented Dec 13, 2024

is it save to update? any news?

@codmpm
Copy link
Owner

codmpm commented Dec 13, 2024

Did not have time to look into it, yet. Sorry.

@bausi2k
Copy link

bausi2k commented Dec 13, 2024

just asking - thx for your time spent for the project.

@thmeger
Copy link
Collaborator

thmeger commented Dec 19, 2024

I'm on 15.3.12.13 with NR 4.0.7. No problem till now. So, the problem comes with an even newer version of Loxone config?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants