You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the Derecho 2.0 release candidate, there can be a significant delay when forming the first view, growing larger as a function of group size. This seems to be due to the fact that Derecho's RDMC layer uses a TCP connections as helpers. Apparently, on Linux systems used as general-purpose servers, the rapid establishment of a large number of TCP connections triggers a Linux mechanism intended to protect against TCP SYN flood attacks. In contrast, when we have done experiments on HPC platforms, we have not observed any issue. As far as we can tell, HPC Linux systems seem to be configured with the TCP SYN flood protection mechanism disabled.
Background: In Derecho, there is at least 1 TCP connection between each pair of members in a top-level group. During startup, there is also 1 TCP connection from each joining member to the restart leader. When LibFabrics is employed, there will be one more TCP connection per member. (Note that we are talking about connected TCP sessions. Each member also has 5 ports on which it listens, but these do not pose any issue.)
The role of all of these TCP connections is to assist a joining member in obtaining its initial state, and then as helpers when setting up RDMA connections: the Derecho RDMC layer was originally designed to create these on its own over Infiniband Verbs (IBV). In this IBV configuration, we require a TCP connection over which we exchange RDMA qpair keys in order to bind qpairs into two-sided RDMA sessions.
With LibFabrics, RDMC apparently still creates its own TCP connections, and then LibFabrics creates its own helper TCP connections at the kernel level .
At any rate, the effect of this is that with N members in the top-level group, Derecho forms 2N TCP connections (3N in the case of the leader). Unfortunately, this resembles a TCP SYN attack, and has the effect of making the initial Derecho startup slow. In our experiments, no issue arises with 4-member groups but when running on LibFabrics, a group of 15-members needs 4 seconds to start up. We believe that on platforms where this effect arises, it would introduce a slowdown roughly linear in the group size.
To correct this problem, we plan to investigate options for disabling the Linux TCP SYN-flood protection mechanism for Derecho applications, at least for the specific TCP connections we use as RDMA helpers. We also plan to eliminate any unused or duplicative TCP connections.
The text was updated successfully, but these errors were encountered:
In the Derecho 2.0 release candidate, there can be a significant delay when forming the first view, growing larger as a function of group size. This seems to be due to the fact that Derecho's RDMC layer uses a TCP connections as helpers. Apparently, on Linux systems used as general-purpose servers, the rapid establishment of a large number of TCP connections triggers a Linux mechanism intended to protect against TCP SYN flood attacks. In contrast, when we have done experiments on HPC platforms, we have not observed any issue. As far as we can tell, HPC Linux systems seem to be configured with the TCP SYN flood protection mechanism disabled.
Background: In Derecho, there is at least 1 TCP connection between each pair of members in a top-level group. During startup, there is also 1 TCP connection from each joining member to the restart leader. When LibFabrics is employed, there will be one more TCP connection per member. (Note that we are talking about connected TCP sessions. Each member also has 5 ports on which it listens, but these do not pose any issue.)
The role of all of these TCP connections is to assist a joining member in obtaining its initial state, and then as helpers when setting up RDMA connections: the Derecho RDMC layer was originally designed to create these on its own over Infiniband Verbs (IBV). In this IBV configuration, we require a TCP connection over which we exchange RDMA qpair keys in order to bind qpairs into two-sided RDMA sessions.
With LibFabrics, RDMC apparently still creates its own TCP connections, and then LibFabrics creates its own helper TCP connections at the kernel level .
At any rate, the effect of this is that with N members in the top-level group, Derecho forms 2N TCP connections (3N in the case of the leader). Unfortunately, this resembles a TCP SYN attack, and has the effect of making the initial Derecho startup slow. In our experiments, no issue arises with 4-member groups but when running on LibFabrics, a group of 15-members needs 4 seconds to start up. We believe that on platforms where this effect arises, it would introduce a slowdown roughly linear in the group size.
To correct this problem, we plan to investigate options for disabling the Linux TCP SYN-flood protection mechanism for Derecho applications, at least for the specific TCP connections we use as RDMA helpers. We also plan to eliminate any unused or duplicative TCP connections.
The text was updated successfully, but these errors were encountered: