From f3e587be0bacf8a05d4df8e1d69894c3f82e4d5e Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Sosth=C3=A8ne=20Gu=C3=A9don?= Date: Tue, 17 Sep 2024 11:59:02 +0200 Subject: [PATCH 1/3] nethsm-pkcs11: add section for network reliability features --- nethsm/pkcs11-setup.rst | 24 ++++++++++++++++++++++++ 1 file changed, 24 insertions(+) diff --git a/nethsm/pkcs11-setup.rst b/nethsm/pkcs11-setup.rst index ca11b4cd0e..08caed007d 100644 --- a/nethsm/pkcs11-setup.rst +++ b/nethsm/pkcs11-setup.rst @@ -135,6 +135,30 @@ If multiple NetHSM instances are listed in the same slot, these instances must b The module will use the instances in a round-robin fashion, trying another instance if one fails. +Network reliability +~~~~~~~~~~~~~~~~~~~ + +To improve the reliability of the pkcs11 module, it is possible to configure timeouts, retries, instance redundancy and TCP keepalives. + +Retries +^^^^^^^ + +If a NetHSM instance is unreachable, the pkcs11 module is capable of retrying sending the request to other instances, or to the same instance (if other instances are also unreachable). +It is possible to introduce a delay between retries. + +- Failing Instances are marked as unreachable and retried in a background thread, so they won't be tried unless all instances are unreachable +- If no background thread can be spawned (`CKF_LIBRARY_CANT_CREATE_OS_THREADS`), failed instances will be tried during normal operations, slowing down the requests. To minimise this, such "inline" health checks are limited to 1 seconds timeouts, and only 3 health checks can be attempted per request (this is a worst case situation that can only be reached if a large numbe of instances are failed). +- The total number of requests is: ``retries.count`` + 1 +- The total timeout for 1 request attempt is: (``retries.count`` + 1) * ``timeout_seconds`` + 3 +- The total timeout for 1 PKCS11 function call will vary because some functions will lead to multiple API calls in the nethsm. + +TCP keepalive +^^^^^^^^^^^^^ + +To improve performance, connections are kept open with the NetHSM instances to avoid the need for re-opening them. +It is possible that in a network with a firewall, these idle connection could be closed, leading to the next connection attempt to timeout. +To prevent slow timeouts from happening, and to detect earlier if it does, it is possible to configure TCP keepalives for these. + Users ~~~~~ From 57af7eb6c2d7399f964415f488a5d3e9bed5fcdb Mon Sep 17 00:00:00 2001 From: jans23 Date: Tue, 17 Sep 2024 12:21:46 +0200 Subject: [PATCH 2/3] typos --- nethsm/pkcs11-setup.rst | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/nethsm/pkcs11-setup.rst b/nethsm/pkcs11-setup.rst index 08caed007d..10808a1ed1 100644 --- a/nethsm/pkcs11-setup.rst +++ b/nethsm/pkcs11-setup.rst @@ -138,19 +138,19 @@ The module will use the instances in a round-robin fashion, trying another insta Network reliability ~~~~~~~~~~~~~~~~~~~ -To improve the reliability of the pkcs11 module, it is possible to configure timeouts, retries, instance redundancy and TCP keepalives. +To improve the reliability of the PKCS#11 module, it is possible to configure timeouts, retries, instance redundancy and TCP keepalives. Retries ^^^^^^^ -If a NetHSM instance is unreachable, the pkcs11 module is capable of retrying sending the request to other instances, or to the same instance (if other instances are also unreachable). +If a NetHSM instance is unreachable, the PKCS#11 module is capable of retrying sending the request to other instances, or to the same instance (if other instances are also unreachable). It is possible to introduce a delay between retries. -- Failing Instances are marked as unreachable and retried in a background thread, so they won't be tried unless all instances are unreachable -- If no background thread can be spawned (`CKF_LIBRARY_CANT_CREATE_OS_THREADS`), failed instances will be tried during normal operations, slowing down the requests. To minimise this, such "inline" health checks are limited to 1 seconds timeouts, and only 3 health checks can be attempted per request (this is a worst case situation that can only be reached if a large numbe of instances are failed). +- Failing instances are marked as unreachable and retried in a background thread, so they won't be tried unless all instances are unreachable +- If no background thread can be spawned (`CKF_LIBRARY_CANT_CREATE_OS_THREADS`), failed instances will be tried during normal operations, slowing down the requests. To minimise this, such "inline" health checks are limited to 1 second timeouts, and only 3 health checks can be attempted per request (this is a worst case situation that can only be reached if a large number of instances failed). - The total number of requests is: ``retries.count`` + 1 - The total timeout for 1 request attempt is: (``retries.count`` + 1) * ``timeout_seconds`` + 3 -- The total timeout for 1 PKCS11 function call will vary because some functions will lead to multiple API calls in the nethsm. +- The total timeout for 1 PKCS#11 function call will vary because some functions will lead to multiple API calls in the NetHSM. TCP keepalive ^^^^^^^^^^^^^ @@ -166,7 +166,7 @@ The operator and administrator users are both optional but the module won't star When the two users are set the module will use the operator by default and only use the administrator user when the action needs it. -The regular PKCS11 user is mapped to the NetHSM operator and the PKCS11 SO is mapped to the NetHSM administrator. +The regular PKCS#11 user is mapped to the NetHSM operator and the PKCS#11 SO is mapped to the NetHSM administrator. Passwords ~~~~~~~~~ From 05262df18b6ebb3882302c5d6bd611768d432d2c Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Sosth=C3=A8ne=20Gu=C3=A9don?= Date: Tue, 17 Sep 2024 14:52:54 +0200 Subject: [PATCH 3/3] Clarify description of retries --- nethsm/pkcs11-setup.rst | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/nethsm/pkcs11-setup.rst b/nethsm/pkcs11-setup.rst index 10808a1ed1..0e90ef5d1d 100644 --- a/nethsm/pkcs11-setup.rst +++ b/nethsm/pkcs11-setup.rst @@ -148,9 +148,12 @@ It is possible to introduce a delay between retries. - Failing instances are marked as unreachable and retried in a background thread, so they won't be tried unless all instances are unreachable - If no background thread can be spawned (`CKF_LIBRARY_CANT_CREATE_OS_THREADS`), failed instances will be tried during normal operations, slowing down the requests. To minimise this, such "inline" health checks are limited to 1 second timeouts, and only 3 health checks can be attempted per request (this is a worst case situation that can only be reached if a large number of instances failed). -- The total number of requests is: ``retries.count`` + 1 -- The total timeout for 1 request attempt is: (``retries.count`` + 1) * ``timeout_seconds`` + 3 -- The total timeout for 1 PKCS#11 function call will vary because some functions will lead to multiple API calls in the NetHSM. + +Therefore: + +- The maximum number of requests sent for one API call is: ``retries.count`` + 1 + 3 +- The maximum (worst case) duration before reaching the timeout for one API call is: (``retries.count`` + 1) * ``timeout_seconds`` + 3 +- The maximum timeout for one PKCS#11 function call will vary because some functions will lead to multiple API calls in the NetHSM. TCP keepalive ^^^^^^^^^^^^^