Adding support for asynchronous signing #2553

waterson · 2023-09-05T19:32:05Z

waterson
Sep 5, 2023

Motivation

We run an LDK-based multi-tenant container; i.e., a single process that runs several independent Lightning nodes concurrently. For some of these nodes, we'd like to delegate the signing operations to a remote service.

A remote signing service could be a secure enclave running in the same datacenter, a service operated by a third-party that maintains the signing keys, or even an individual device.

In each of these cases, we'd make an remote request to the signing service to perform the actual signing operation, or to provide LDK with a necessary secret (e.g., the per-commitment secret).

For this to work, each of the operations that LDK might perform must be fallible; i.e., it should be allowed to transiently fail, and then be resumed later when the result becomes available.

Currently (as of 0.0.116), LDK's signing interfaces (e.g., ChannelSigner) are not infallible:

There are some operations that simply do not admit failure at all -- i.e., do not return a Result type. For example get_per_commitment_point returns PublicKey and admits no possibility that the commitment point is not immediately available.
There are some operations that do return a Result type, but are fallible "in signature only": actually returning an error will crash LDK.
There are some operations that return a Result type but any error result will cause an immediate channel force-closure.

We'd like to work through the various signing interfaces and improve LDK's implementation to support the above use case. In particular, each method should admit an implementation that may not immediately have a result but has not failed permanently.

As a motivating example, consider an implementation of the ChannelSigner interface implemented using webhooks. In a canonical webhook-based design, a request is sent via HTTP POST to a remote server. The response is typically a short 200 OK, followed later by the remote server issuing an HTTP POST back to the requester with the results.

Proof-of-concept

A proof-of-concept implementation is in-progress in #2487, which specifically addresses ChannelSigner::get_per_commitment_point and ChannelSigner::release_commitment_secret. Our goal here is to explore how we might rework LDK's internals to support 1) these methods returning a Result type (and so they can fail), and 2) resuming the channel state machine appropriately when a result is returned.

After some initial prototyping and discussion, we opted for the following approach:

Allow both get_per_commitment_point and release_commitment_secret to return an error that is simply the unit type, (). Current in-memory implementations of the signer never return an error, and so they need to change only inasmuch as to wrap the results in Ok(...).
Interpret the Err result to be user-defined as follows. If the signing failure is permanent, then the user must handle force-closing the channel themselves after returning the Err result. On the other hand, if the signing failure is temporary (e.g., requires a response from a remote party), then the user can explicitly retry the operation when the results are available.
When a channel operation attempts to get_per_commitment_point or release_commitement_secret and receives an Err result, it unwinds out and stores a retry state associated with the channel in the per-peer state.
This state can be activated through a new ChannelManager method, retry_channel. This method accepts the remote peer's public key and the channel ID, and restarts the operation that previously had failed. The assumption is that now the request to get_per_commitment_point or release_commitment_secret will succeed because the signer implementation will have the required material.

As an example, consider the following (somewhat simplified) flow that occurs during commitment_signed:

Here the WebookSignerImpl is an implementation of the ChannelSigner interface provided by a user, and SigningService is the service to which that implementation is delegating the signature operations.

Upon receiving an error response from the signer, the commitment_signed handler in the Channel propagates the error out to the ChannelManager that then notes that the channel is pending retry for commitment_signed. (Specifically, it does so by adding an entry in a new per-peer state table keyed by channel ID whose value is an enum with sufficient side-information to restart the operation.)

Later, when the user's WebhookSignerImpl has been provided with information sufficient to proceed, it invokes the retry_channel on the ChannelManager, passing in the peer and channel ID. From this, the ChannelManager can recover the retry state and restart the commitment_signed processing.

Overview of changes

Modify get_per_commitment_point and release_commitment_secret to return a Result type. The error type is the unit type, and is interpreted to mean either a) the channel is not ready, and the user will later attempt to resume processing by calling retry_channel, or b) the channel signer has permanently failed and the user will eventually force-close (or abandon) the channel.
Store the current holder per-commitment point and the previous holder commitment secret as part of a channel's Context. These are both Option values, with None for a newly constructed channel awaiting its first commitment point (or revocation).
Allow point and secret to be persisted and restored by adding new entries to the TLV structure used to persist a channel.
by the caller to mean "the information is not ready".
Modify code that uses the per-commitment point or secret to use the cached values rather than assuming that the signer may simply be called at any point.
Modify the following Channel and ChannelManager handlers that initialize or modify the per-commitment point to cache the values correctly:
- Channel::funding_signed
- Channel::commitment_signed
- Channel::channel_reestablish
- Channel::funding_created
- ChannelManager::do_accept_inbound_channel
- ChannelManager::create_channel (still to do)

TheBlueMatt · 2023-09-06T17:54:30Z

TheBlueMatt
Sep 6, 2023
Maintainer

CC #2088

0 replies

TheBlueMatt · 2023-09-06T18:01:50Z

TheBlueMatt
Sep 6, 2023
Maintainer

We discussed this a bit more on discord, but I prefer the approach in #2554 better - in general, we use signing for "generate a message" logic, and most of that logic is already replay-able as we need to replay if we disconnect from the peer and reconnect to discover that the original message never made it to the peer.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lightning Dev Kit

Adding support for asynchronous signing #2553

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Lightning Dev Kit

Adding support for asynchronous signing #2553

waterson Sep 5, 2023

Motivation

Proof-of-concept

Overview of changes

Replies: 2 comments

TheBlueMatt Sep 6, 2023 Maintainer

TheBlueMatt Sep 6, 2023 Maintainer

waterson
Sep 5, 2023

TheBlueMatt
Sep 6, 2023
Maintainer

TheBlueMatt
Sep 6, 2023
Maintainer