Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Requesting too many channels can cause pva gateway to become unresponsive #139

Open
simon-ess opened this issue Jan 24, 2024 · 15 comments
Open

Comments

@simon-ess
Copy link

If you request too many channels from the gateway, then it can become fully unresponsive despite still technically running; the only solution seems to be to restart the gateway. The number of channels is related to the max number of file descriptors that the GW is allowed to use.

To reproduce the issue:

  • Set up three hosts, a GW, an IOC host (mine has 25 IOCs all with a common PV), and a host performing requests
  • On the GW reduce the ulimit as much as possible. On our setup, I could get away with setting LimitNOFILE=27 in the service file for the gateway.
  • Request through the gateway a bunch of PVs.

More specifically, what I have observed is the following:

  • If you request a smaller number of PVs that the GW can handle, all is well. In my case, this is requesting a single PV.
  • If you request slightly over that number of PVs, the GW will partially respond, then recover a few minutes later on its own. In my setup, this was already at 2 PVs. However, it does recover a few minutes later and responds correctly to single PV requests.
  • If you request significantly more (in my case, 25 PVs does the trick) then the GW simply enters an unresponsive state. I haven't seen it recover from this at all, and requires a restart.

Example logs from requesting 3 PVs:

INFO:p4p.gw:    EPICS_PVAS_SERVER_PORT : 5075
INFO:p4p.gw:    EPICS_PVA_ADDR_LIST : 1.1.1.1:5075
INFO:p4p.gw:    EPICS_PVA_AUTO_ADDR_LIST : NO
INFO:p4p.gw:    EPICS_PVA_BROADCAST_PORT : 5076
INFO:p4p.gw:    EPICS_PVA_CONN_TMO : 30
INFO:p4p.gw:    EPICS_PVA_INTF_ADDR_LIST : 1.1.1.1:5075
INFO:p4p.gw:    EPICS_PVA_NAME_SERVERS :
INFO:p4p.gw:    EPICS_PVA_SERVER_PORT : 5075
INFO:p4p.gw:Setup GW clients to ignore GW servers
2024-01-24T08:14:56.948553373 DEBUG p4p.gw 0x55fd7876a140 sweeps
2024-01-24T08:15:21.537241375 DEBUG p4p.gw 0x55fd7876a140 'I-1:ModuleVersions' channel cache miss
2024-01-24T08:15:21.537322500 DEBUG p4p.gw 0x55fd7876a140 new upstream channel 'I-1:ModuleVersions'
2024-01-24T08:15:21.537325103 DEBUG p4p.gw 0x55fd7876a140 test 'I-1:ModuleVersions' -> _
2024-01-24T08:15:21.537330196 DEBUG p4p.gw 0x55fd7876a140 testChannel '1.1.1.3:33043':'I-1:ModuleVersions' -> 0
2024-01-24T08:15:21.537338819 DEBUG p4p.gw 0x55fd7876a140 'I-2:ModuleVersions' channel cache miss
2024-01-24T08:15:21.537341782 DEBUG p4p.gw 0x55fd7876a140 new upstream channel 'I-2:ModuleVersions'
2024-01-24T08:15:21.537342971 DEBUG p4p.gw 0x55fd7876a140 test 'I-2:ModuleVersions' -> _
2024-01-24T08:15:21.537345199 DEBUG p4p.gw 0x55fd7876a140 testChannel '1.1.1.3:33043':'I-2:ModuleVersions' -> 0
2024-01-24T08:15:21.537350442 DEBUG p4p.gw 0x55fd7876a140 'I-3:ModuleVersions' channel cache miss
2024-01-24T08:15:21.537352995 DEBUG p4p.gw 0x55fd7876a140 new upstream channel 'I-3:ModuleVersions'
2024-01-24T08:15:21.537354164 DEBUG p4p.gw 0x55fd7876a140 test 'I-3:ModuleVersions' -> _
errlog: lost 1 messages
2024-01-24T08:15:21.537403457 DEBUG p4p.gw 0x55fd7876a140 upstream disconnect 'I-1:ModuleVersions'
2024-01-24T08:15:21.537410289 DEBUG p4p.gw 0x55fd7876a140 upstream disconnect 'I-2:ModuleVersions'
2024-01-24T08:15:21.537412278 DEBUG p4p.gw 0x55fd7876a140 upstream disconnect 'I-3:ModuleVersions'
2024-01-24T08:15:21.547952363 DEBUG p4p.gw 0x55fd7876a140 'I-1:ModuleVersions' channel cache hit
2024-01-24T08:15:21.547976840 DEBUG p4p.gw 0x55fd7876a140 test 'I-1:ModuleVersions' -> _
2024-01-24T08:15:21.547996749 DEBUG p4p.gw 0x55fd7876a140 testChannel '1.1.1.1:34801':'I-1:ModuleVersions' -> 0
2024-01-24T08:15:21.548030131 DEBUG p4p.gw 0x55fd7876a140 'I-2:ModuleVersions' channel cache hit
2024-01-24T08:15:21.548035587 DEBUG p4p.gw 0x55fd7876a140 test 'I-2:ModuleVersions' -> _
2024-01-24T08:15:21.548045568 DEBUG p4p.gw 0x55fd7876a140 testChannel '1.1.1.1:34801':'I-2:ModuleVersions' -> 0
2024-01-24T08:15:21.548071791 DEBUG p4p.gw 0x55fd7876a140 'I-3:ModuleVersions' channel cache hit
2024-01-24T08:15:21.548085894 DEBUG p4p.gw 0x55fd7876a140 test 'I-3:ModuleVersions' -> _
2024-01-24T08:15:21.548092670 DEBUG p4p.gw 0x55fd7876a140 testChannel '1.1.1.1:34801':'I-3:ModuleVersions' -> 0
2024-01-24T08:15:21.553156558 CRIT pvxs.client.io Unhandled error in search Rx callback: Unable to begin connecting
Dumping a stack trace of thread 'PVXCTCP':
[    0x7f11c2e32993]: /opt/conda/envs/pva-gateway/lib/python3.7/site-packages/p4p/../../../../epics/lib/linux-x86_64/libCom.so.3.22.0(epicsStackTrace+0x53)
[    0x7f11c2cb75b1]: /opt/conda/envs/pva-gateway/lib/python3.7/site-packages/p4p/../../../libpvxs.so.1.3(_ZN4pvxs6detail11_log_printfEjPKcz+0xa1)
elfRead() -- unable to open file: Too many open files
[    0x7f11c2d7ac34]: /opt/conda/envs/pva-gateway/lib/python3.7/site-packages/p4p/../../../libpvxs.so.1.3(<no symbol information>)
elfRead() -- unable to open file: Too many open files
[    0x7f11c2a43c54]: /opt/conda/envs/pva-gateway/lib/python3.7/site-packages/p4p/../../.././libevent_core-2.1.so.6(<no symbol information>)
[    0x7f11c2a4435f]: /opt/conda/envs/pva-gateway/lib/python3.7/site-packages/p4p/../../.././libevent_core-2.1.so.6(event_base_loop+0x3cf)
[    0x7f11c2d18064]: /opt/conda/envs/pva-gateway/lib/python3.7/site-packages/p4p/../../../libpvxs.so.1.3(<no symbol information>)
[    0x7f11c2e275e9]: /opt/conda/envs/pva-gateway/lib/python3.7/site-packages/p4p/../../../../epics/lib/linux-x86_64/libCom.so.3.22.0(epicsThreadCallEntryPoint+0x39)
elfRead() -- unable to open file: Too many open files
[    0x7f11c2e2c37e]: /opt/conda/envs/pva-gateway/lib/python3.7/site-packages/p4p/../../../../epics/lib/linux-x86_64/libCom.so.3.22.0(<no symbol information>)
elfRead() -- unable to open file: Too many open files
[    0x7f11c3fffea5]: /lib64/libpthread.so.0(<no symbol information>)
[    0x7f11c34178dd]: /lib64/libc.so.6(clone+0x6d)
2024-01-24T08:15:21.562152682 DEBUG p4p.gw 0x55fd7876a140 upstream connect 'I-2:ModuleVersions'
2024-01-24T08:15:21.562256243 DEBUG p4p.gw 0x55fd7876a140 upstream connect 'I-1:ModuleVersions'
2024-01-24T08:15:22.024572760 DEBUG p4p.gw 0x55fd7876a140 'I-3:ModuleVersions' channel cache hit
2024-01-24T08:15:22.024611594 DEBUG p4p.gw 0x55fd7876a140 test 'I-3:ModuleVersions' -> _
2024-01-24T08:15:22.024642473 DEBUG p4p.gw 0x55fd7876a140 testChannel '1.1.1.1:34801':'I-3:ModuleVersions' -> 0
2024-01-24T08:15:22.026337031 CRIT pvxs.client.io Unhandled error in search Rx callback: Unable to begin connecting
Dumping a stack trace of thread 'PVXCTCP':
[    0x7f11c2e32993]: /opt/conda/envs/pva-gateway/lib/python3.7/site-packages/p4p/../../../../epics/lib/linux-x86_64/libCom.so.3.22.0(epicsStackTrace+0x53)
[    0x7f11c2cb75b1]: /opt/conda/envs/pva-gateway/lib/python3.7/site-packages/p4p/../../../libpvxs.so.1.3(_ZN4pvxs6detail11_log_printfEjPKcz+0xa1)
[    0x7f11c2d7ac34]: /opt/conda/envs/pva-gateway/lib/python3.7/site-packages/p4p/../../../libpvxs.so.1.3(<no symbol information>)
[    0x7f11c2a43c54]: /opt/conda/envs/pva-gateway/lib/python3.7/site-packages/p4p/../../.././libevent_core-2.1.so.6(<no symbol information>)
[    0x7f11c2a4435f]: /opt/conda/envs/pva-gateway/lib/python3.7/site-packages/p4p/../../.././libevent_core-2.1.so.6(event_base_loop+0x3cf)
[    0x7f11c2d18064]: /opt/conda/envs/pva-gateway/lib/python3.7/site-packages/p4p/../../../libpvxs.so.1.3(<no symbol information>)
[    0x7f11c2e275e9]: /opt/conda/envs/pva-gateway/lib/python3.7/site-packages/p4p/../../../../epics/lib/linux-x86_64/libCom.so.3.22.0(epicsThreadCallEntryPoint+0x39)
[    0x7f11c2e2c37e]: /opt/conda/envs/pva-gateway/lib/python3.7/site-packages/p4p/../../../../epics/lib/linux-x86_64/libCom.so.3.22.0(<no symbol information>)
[    0x7f11c3fffea5]: /lib64/libpthread.so.0(<no symbol information>)
[    0x7f11c34178dd]: /lib64/libc.so.6(clone+0x6d)
2024-01-24T08:15:22.518578173 DEBUG p4p.gw 0x55fd7876a140 'I-1:ModuleVersions' channel cache hit
2024-01-24T08:15:22.518602016 DEBUG p4p.gw 0x55fd7876a140 test 'I-1:ModuleVersions' -> !
2024-01-24T08:15:22.518610584 DEBUG p4p.gw 0x55fd7876a140 testChannel '1.1.1.3:33043':'I-1:ModuleVersions' -> 1
2024-01-24T08:15:22.518627040 DEBUG p4p.gw 0x55fd7876a140 'I-2:ModuleVersions' channel cache hit
2024-01-24T08:15:22.518631494 DEBUG p4p.gw 0x55fd7876a140 test 'I-2:ModuleVersions' -> !
2024-01-24T08:15:22.518635554 DEBUG p4p.gw 0x55fd7876a140 testChannel '1.1.1.3:33043':'I-2:ModuleVersions' -> 1
2024-01-24T08:15:22.518644302 DEBUG p4p.gw 0x55fd7876a140 'I-3:ModuleVersions' channel cache hit
2024-01-24T08:15:22.518646187 DEBUG p4p.gw 0x55fd7876a140 test 'I-3:ModuleVersions' -> _
2024-01-24T08:15:22.518648576 DEBUG p4p.gw 0x55fd7876a140 testChannel '1.1.1.3:33043':'I-3:ModuleVersions' -> 0
libevent Warn: Error from accept() call: Too many open files
libevent Warn: Error from accept() call: Too many open files
libevent Warn: Error from accept() call: Too many open files
libevent Warn: Error from accept() call: Too many open files
libevent Warn: Error from accept() call: Too many open files
libevent Warn: Error from accept() call: Too many open files

For a request of 25 PVs, it looks much the same but with a lot more blocks of

2024-01-24T08:16:57.399401493 CRIT pvxs.client.io Unhandled error in search Rx callback: Unable to begin connecting
@simon-ess
Copy link
Author

Our main concern is that the GW becomes unresponsive needing a restart; obviously there is an upper limit on the number of connections somewhere along the way, but we would hope that the GW is more graceful about this failure than it currently is.

@simon-ess
Copy link
Author

I should finally note that this is tested on p4p 4.1.12 built against pvxs 1.3.1.

@mdavidsaver
Copy link
Member

@simon-ess Can you check to see if any of your gateway processes have duplicate TCP connections open?

eg. does netstat -tpn | grep <PID> show duplicates in the Foreign Address column?

@mdavidsaver
Copy link
Member

  • LimitNOFILE=27

As I think about it, this number may be too low to be realistic. My usual local test gateway configuration with two "servers":[ and one "clients":[ has 35 FDs open before the first PVA client connects.

$ ls -l /proc/`pgrep pvagw`/fd|wc -l
35

@mdavidsaver
Copy link
Member

I happened to notice one FD leak due to my misunderstanding of the python sqlite3 API. sqlite3.Connection.__exit__() does not close().

https://github.com/mdavidsaver/p4p/blob/a03400e4f4e89202d45991eb37c0258b6e2497dc/src/p4p/gw.py#L174

https://docs.python.org/3/library/sqlite3.html#how-to-use-the-connection-context-manager

This should be fixed by 70b030d.

However, I don't think this leak would explain FD exhaustion in a real gateway. In my test environment, it seems to be bounded by python GC. I saw ~20 sqlite Connections hanging around after an hour.

@mdavidsaver
Copy link
Member

epics-base/pvxs@c2e5fdc avoids a possible FD leak, if an async connect() fails. Logged as Unable to begin connecting.

@mdavidsaver
Copy link
Member

@simon-ess Have you been able to update and/or observed any more instances of FD exhaustion?

@simon-ess
Copy link
Author

@mdavidsaver We have not, but we also increased the resources available to our gateway when we discovered this issue.

If you would like, I could re-run my test setup with a newer version of everything.

@simon-ess
Copy link
Author

We have finally spent a bit of time looking at this again. What I can report is the following:

With PVXS built on the latest commit (https://github.com/epics-base/pvxs/tree/5fa743d4c87377859953012af3c0fbcd1b063129) and p4p v 4.1.12 built against that, then we do see the worst situation seemingly avoided.

  • If one requests a small number of PVs then there is no issue
  • If one requests a large number then in general the gateway becomes unresponsive for a time
  • However, it does not completely crash and does self-recover

This is definitely an improvement, although in practice if the number of PVs is quite large in comparison to the resources offered to the gateway then the gateway becomes unresponsive for long enough one might not realise that it will recover on its own.

@mdavidsaver
Copy link
Member

  • If one requests a large number then in general the gateway becomes unresponsive for a time

Have you any idea what is happening while this is happening? I'm guessing 100% CPU usage?

You might try running one of the python profiling tools as I think the bottleneck will be in the python code handling search requests.

@simon-ess
Copy link
Author

I doubt it is the CPU usage since we can cause it to come and go by modifying the allowed number of file descriptors.

@mdavidsaver
Copy link
Member

... by modifying the allowed number of file descriptors.

@simon-ess I am confused. From #139 (comment) I understood that you were no longer observing FD exhaustion errors. Has this changed since April?

While you have an unresponsive gateway, can you dump the list of open FDs and see if any interesting patterns appear? (eg. multiple TCP connections with the same destination. lots of TCP sockets in close-wait states. ...)

eg. run lsof -p <pid> -a -d fd for a more informative version of ls -l /proc/<pid>/fd.

Looking back to #139 (comment), can you report what LimitNOFILE your gateway(s) are running with, and how this number was chosen?

@mdavidsaver
Copy link
Member

eg. With pvagw running with ~ the example config one a single host I observe:

  1. a baseline FD usage of 31.
  2. If I launch 100 IOCs behind this gateway, and connect with 1 client, FD usage rises to 132 (31 + 100 + 1)
  3. If I stop the client FD usage falls immediately to 131 (31 + 100)
  4. If I wait 60 seconds (2x the 30 second channel sweep period), FD usage falls back to 31.
  5. The only TCP sockets shown by lsof -p <pid> -a -d fd are in the LISTEN state.

I have tried a couple of variations on this. eg. short lived vs. long lived clients. monitor vs. get. A couple of clients with the same, or overlapping PV lists. So far I see the expected FD usage of 31 + #servers + #clients, allowing for the 60 second garbage collection period for unused upstream channels.

@mdavidsaver
Copy link
Member

  • If one requests a large number then in general the gateway becomes unresponsive for a time

... allowing for the 60 second garbage collection period for unused upstream channels.

As I wrote this, I started thinking. Currently, there is no absolute bound on the number of channels in the client channel cache. Including on currently unused channels. Having this fill up with unused channels might explain the "... for a time" part. eg. if that interval was 60 -> 90 seconds.

Entries in the channel cache should only have a TCP connection if some down-stream server gave a positive search response, and that server is still up. So simply having up-stream clients searching for non-existent PV names does not cause extra FD usage.

While I can't quite imagine a specific scenario, I suppose it would possible for an IOC in a restart loop to continue to keep an FD in use until timeout or GC sweep. Although, in the case of soft IOCs, I would expect such connections to be RST before the GC happens.

And of course there is always the "problem" of short-lived down-stream clients. Launching 10,000 instances of pvmonitor going through a gateway will consume at least 10,000 sockets on the gateway host regardless of how many downstream IOCs would be involved.

(adding an gateway configurable limit on the number of upstream clients in total and/or per host might an interesting project)

In short I need some more data to generate ideas of where to look next. (metrics, error messages, ...)

@simon-ess
Copy link
Author

Sorry if I wasn't clear: We resolved our observed issue by increasing the number of file descriptors, I believe to 2048. What we are hoping to improve is the behaviour if that limit is broached.

The reason this is relevant is that a user had written a script to fetch a lot of PVs through a gateway, causing it to completely crash. What we would like is better behaviour if someone--intentionally or otherwise--performs a DOS attack on the gateway.

In my test setup I have a gateway with a max of 27 in LimitNOFILE; this allows for two PVs to be fetched through the gateway before all falls down. If I make a large number of requests, then the (relevant) part of the output of lsof is

pvagw   13976 pvagw    0r      CHR                1,3       0t0      5370 /dev/null
pvagw   13976 pvagw    1u     unix 0xffff8ffcd6654000       0t0     82800 socket
pvagw   13976 pvagw    2u     unix 0xffff8ffcd6654000       0t0     82800 socket
pvagw   13976 pvagw    3r      CHR                1,9       0t0      5375 /dev/urandom
pvagw   13976 pvagw    4u      REG              253,0     28672  38739152 /tmp/tmpamnypqmm
pvagw   13976 pvagw    5u  a_inode               0,10         0      5366 [eventpoll]
pvagw   13976 pvagw    6r     FIFO                0,9       0t0     82830 pipe
pvagw   13976 pvagw    7w     FIFO                0,9       0t0     82830 pipe
pvagw   13976 pvagw    8u  a_inode               0,10         0      5366 [eventfd]
pvagw   13976 pvagw    9u     IPv4              82831       0t0       UDP *:59399 
pvagw   13976 pvagw   10u     IPv6              82832       0t0       UDP *:59399 
pvagw   13976 pvagw   11u  a_inode               0,10         0      5366 [eventpoll]
pvagw   13976 pvagw   12r     FIFO                0,9       0t0     82833 pipe
pvagw   13976 pvagw   13w     FIFO                0,9       0t0     82833 pipe
pvagw   13976 pvagw   14u  a_inode               0,10         0      5366 [eventfd]
pvagw   13976 pvagw   15u     IPv4              82836       0t0       UDP *:5076 
pvagw   13976 pvagw   16u     IPv6              82837       0t0       UDP *:60711 
pvagw   13976 pvagw   17u  a_inode               0,10         0      5366 [eventpoll]
pvagw   13976 pvagw   18r     FIFO                0,9       0t0     82838 pipe
pvagw   13976 pvagw   19w     FIFO                0,9       0t0     82838 pipe
pvagw   13976 pvagw   20u  a_inode               0,10         0      5366 [eventfd]
pvagw   13976 pvagw   21u     IPv4              82839       0t0       UDP *:36873 
pvagw   13976 pvagw   22u     sock                0,7       0t0     82840 protocol: UDPv6
pvagw   13976 pvagw   23u     IPv4           72286672       0t0       TCP one.one.one.one:39190->security.cloudflare-dns.com:53816 (ESTABLISHED)
pvagw   13976 pvagw   24u     IPv4              82844       0t0       TCP one.one.one.one:pvaccess (LISTEN)
pvagw   13976 pvagw   25u     IPv4           72286673       0t0       TCP one.one.one.one:39760->security.cloudflare-dns.com:60127 (ESTABLISHED)
pvagw   13976 pvagw   26u     IPv4           72286674       0t0       TCP one.one.one.one:51904->security.cloudflare-dns.com:43035 (ESTABLISHED)

From what I can see, it is the last ESTABLISHED lines that are the new ones; everything up to and including fd 22u is present before the request.

Attached are two sets of logs; the first one is from a request for 25 PVs through the gateway; it has been somewhat cleaned up as it is extremely verbose. As near as I an tell all of the stack traces are the same.

The second one is from a successful request of a single PV through the gateway.

pva-gw-crash-short.log
pva-gw-ok.log

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants