Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Watchdog timeout on large point count N4 Supervisors w/ read op #14

Open
tblong opened this issue Jun 3, 2022 · 7 comments
Open

Watchdog timeout on large point count N4 Supervisors w/ read op #14

tblong opened this issue Jun 3, 2022 · 7 comments

Comments

@tblong
Copy link

tblong commented Jun 3, 2022

image

Related to nHaystack v3.0.1+. When performing a read operation such as read?filter=point+and+cur against a large point count N4 Supervisor (as above), we have seen the watchdog timeout get triggered and the station restarts. The watchdog event occurs even when adding the optional and a low value for the limit parameter.

Questions:

  1. Is a filter based read operation currently executed on the main engine thread within Niagara?
  2. Are there any optimizations to make the read op better here or to ensure the op performs work off the main engine thread?
@ci-richard-mcelhinney
Copy link
Owner

Hi @tblong,

That is a lot of points! I haven't tested this before with a station of that size and no one else has reported using nhaystack with an application with that many points.

I'll need to have a look at the code and see what we can do.

@ci-richard-mcelhinney
Copy link
Owner

HI @tblong ,

I have done an initial investigation into this situation and making a change to the threading arrangement for the servlet isn't as simple as I first thought. I am going to try to and setup a test station with the number of components you have and also make some changes and see what happens.

This kind of change is quite a significant change so I want to proceed carefully.

@ci-richard-mcelhinney
Copy link
Owner

@tblong I have tried a couple of different setups today. The first setup I built had 250,000 points. I didn't get a watchdog timeout I had out of memory issues.

The second setup I lowered the station to 150,000 points. The 'read' query with your filter worked over the REST API, however it is holding on to a lot of memory.

I'm doing all this on a Windows Virutal Machine on my Mac. It has 16GB RAM and 4 cores allocated. It has the default memory settings for the Station JVM.

Can you provide more details on your Supervisor configuration? I think there is a problem but it's more around memory management at this scale. I'm not seeing Watchdog timeouts and station restarts as you indicate. I am using the latest code though, but I don't think that should have made much of a difference.

@ci-richard-mcelhinney
Copy link
Owner

@tblong also I just tested the use of the limit property with the following query https://localhost/haystack/read?filter=point%20and%20cur&limit=10 and that worked as well, it returned super quick.

@tblong
Copy link
Author

tblong commented Jun 29, 2022

@ci-richard-mcelhinney Much thanks for the help digging here. So it seems this might just be a max-heap setting perhaps? Possibility for memory improvements in how nHaystack crawls through the station during a read op as it gathers the response data?

I will be on holiday from 6/30->7/7 but will work on getting all the station metrics and config settings I can on my return.

@ci-richard-mcelhinney
Copy link
Owner

@tblong I've also determined that the REST API requests are not serviced on the Engine Thread in the latest code. I'm not sure about the version you are using. So if you can upgrade you should get similar results to me and hopefully you don't see watchdog timeouts.

@tblong
Copy link
Author

tblong commented Jul 8, 2022

@ci-richard-mcelhinney Got the station metrics gathered below today. We only had browser access for this session so were not able to get what the actual max-heap setting was but were still able to get the memory metrics of the station.

The nHaystack version is v3.2.0:
image

The spy:sysInfo page with certain properties redacted:
image

The spy:util/gc page after forcing a garbage collect:
image

Let me know if there might be any other metric I can grab that will help further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants