-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Requesting too many channels can cause pva gateway to become unresponsive #139
Comments
Our main concern is that the GW becomes unresponsive needing a restart; obviously there is an upper limit on the number of connections somewhere along the way, but we would hope that the GW is more graceful about this failure than it currently is. |
I should finally note that this is tested on p4p 4.1.12 built against pvxs 1.3.1. |
@simon-ess Can you check to see if any of your gateway processes have duplicate TCP connections open? eg. does |
As I think about it, this number may be too low to be realistic. My usual local test gateway configuration with two
|
I happened to notice one FD leak due to my misunderstanding of the python sqlite3 API. https://github.com/mdavidsaver/p4p/blob/a03400e4f4e89202d45991eb37c0258b6e2497dc/src/p4p/gw.py#L174 https://docs.python.org/3/library/sqlite3.html#how-to-use-the-connection-context-manager This should be fixed by 70b030d. However, I don't think this leak would explain FD exhaustion in a real gateway. In my test environment, it seems to be bounded by python GC. I saw ~20 sqlite Connections hanging around after an hour. |
epics-base/pvxs@c2e5fdc avoids a possible FD leak, if an async |
@simon-ess Have you been able to update and/or observed any more instances of FD exhaustion? |
@mdavidsaver We have not, but we also increased the resources available to our gateway when we discovered this issue. If you would like, I could re-run my test setup with a newer version of everything. |
We have finally spent a bit of time looking at this again. What I can report is the following: With PVXS built on the latest commit (https://github.com/epics-base/pvxs/tree/5fa743d4c87377859953012af3c0fbcd1b063129) and p4p v 4.1.12 built against that, then we do see the worst situation seemingly avoided.
This is definitely an improvement, although in practice if the number of PVs is quite large in comparison to the resources offered to the gateway then the gateway becomes unresponsive for long enough one might not realise that it will recover on its own. |
Have you any idea what is happening while this is happening? I'm guessing 100% CPU usage? You might try running one of the python profiling tools as I think the bottleneck will be in the python code handling search requests. |
I doubt it is the CPU usage since we can cause it to come and go by modifying the allowed number of file descriptors. |
@simon-ess I am confused. From #139 (comment) I understood that you were no longer observing FD exhaustion errors. Has this changed since April? While you have an unresponsive gateway, can you dump the list of open FDs and see if any interesting patterns appear? (eg. multiple TCP connections with the same destination. lots of TCP sockets in close-wait states. ...) eg. run Looking back to #139 (comment), can you report what |
eg. With pvagw running with ~ the example config one a single host I observe:
I have tried a couple of variations on this. eg. short lived vs. long lived clients. monitor vs. get. A couple of clients with the same, or overlapping PV lists. So far I see the expected FD usage of |
As I wrote this, I started thinking. Currently, there is no absolute bound on the number of channels in the client channel cache. Including on currently unused channels. Having this fill up with unused channels might explain the "... for a time" part. eg. if that interval was 60 -> 90 seconds. Entries in the channel cache should only have a TCP connection if some down-stream server gave a positive search response, and that server is still up. So simply having up-stream clients searching for non-existent PV names does not cause extra FD usage. While I can't quite imagine a specific scenario, I suppose it would possible for an IOC in a restart loop to continue to keep an FD in use until timeout or GC sweep. Although, in the case of soft IOCs, I would expect such connections to be RST before the GC happens. And of course there is always the "problem" of short-lived down-stream clients. Launching 10,000 instances of (adding an gateway configurable limit on the number of upstream clients in total and/or per host might an interesting project) In short I need some more data to generate ideas of where to look next. (metrics, error messages, ...) |
Sorry if I wasn't clear: We resolved our observed issue by increasing the number of file descriptors, I believe to 2048. What we are hoping to improve is the behaviour if that limit is broached. The reason this is relevant is that a user had written a script to fetch a lot of PVs through a gateway, causing it to completely crash. What we would like is better behaviour if someone--intentionally or otherwise--performs a DOS attack on the gateway. In my test setup I have a gateway with a max of 27 in LimitNOFILE; this allows for two PVs to be fetched through the gateway before all falls down. If I make a large number of requests, then the (relevant) part of the output of
From what I can see, it is the last Attached are two sets of logs; the first one is from a request for 25 PVs through the gateway; it has been somewhat cleaned up as it is extremely verbose. As near as I an tell all of the stack traces are the same. The second one is from a successful request of a single PV through the gateway. |
If you request too many channels from the gateway, then it can become fully unresponsive despite still technically running; the only solution seems to be to restart the gateway. The number of channels is related to the max number of file descriptors that the GW is allowed to use.
To reproduce the issue:
LimitNOFILE=27
in the service file for the gateway.More specifically, what I have observed is the following:
Example logs from requesting 3 PVs:
For a request of 25 PVs, it looks much the same but with a lot more blocks of
The text was updated successfully, but these errors were encountered: