Merge pull request #284 from Nitrokey/nethsm-pkcs11-retries

nethsm-pkcs11: add section for network reliability features
Nitrokey · Sep 19, 2024 · 9dbb8a4 · 9dbb8a4
2 parents 386f166 + 05262df
commit 9dbb8a4
Showing 1 changed file with 28 additions and 1 deletion.
diff --git a/nethsm/pkcs11-setup.rst b/nethsm/pkcs11-setup.rst
@@ -135,14 +135,41 @@ If multiple NetHSM instances are listed in the same slot, these instances must b
 The module will use the instances in a round-robin fashion, trying another instance if one fails.
 
 
+Network reliability
+~~~~~~~~~~~~~~~~~~~
+
+To improve the reliability of the PKCS#11 module, it is possible to configure timeouts, retries, instance redundancy and TCP keepalives.
+
+Retries
+^^^^^^^
+
+If a NetHSM instance is unreachable, the PKCS#11 module is capable of retrying sending the request to other instances, or to the same instance (if other instances are also unreachable).
+It is possible to introduce a delay between retries.
+
+- Failing instances are marked as unreachable and retried in a background thread, so they won't be tried unless all instances are unreachable
+- If no background thread can be spawned (`CKF_LIBRARY_CANT_CREATE_OS_THREADS`), failed instances will be tried during normal operations, slowing down the requests. To minimise this, such "inline" health checks are limited to 1 second timeouts, and only 3 health checks can be attempted per request (this is a worst case situation that can only be reached if a large number of instances failed).
+
+Therefore:
+
+- The maximum number of requests sent for one API call is: ``retries.count`` + 1 + 3
+- The maximum (worst case) duration before reaching the timeout for one API call is: (``retries.count`` + 1) * ``timeout_seconds`` + 3 
+- The maximum timeout for one PKCS#11 function call will vary because some functions will lead to multiple API calls in the NetHSM.
+
+TCP keepalive
+^^^^^^^^^^^^^
+
+To improve performance, connections are kept open with the NetHSM instances to avoid the need for re-opening them.
+It is possible that in a network with a firewall, these idle connection could be closed, leading to the next connection attempt to timeout.
+To prevent slow timeouts from happening, and to detect earlier if it does, it is possible to configure TCP keepalives for these. 
+
 Users
 ~~~~~
 
 The operator and administrator users are both optional but the module won't start if no user is configured. This is so you can configure the module with only an administrator user, only an operator user or both at the same time.
 
 When the two users are set the module will use the operator by default and only use the administrator user when the action needs it.
 
-The regular PKCS11 user is mapped to the NetHSM operator and the PKCS11 SO is mapped to the NetHSM administrator.
+The regular PKCS#11 user is mapped to the NetHSM operator and the PKCS#11 SO is mapped to the NetHSM administrator.
 
 Passwords
 ~~~~~~~~~