Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

node: add amazon kms and benchmark signers #4168

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

pleasew8t
Copy link
Contributor

Amazon KMS Guardian Signer

This PR adds an Amazon AWS KMS Guardian signer, allowing observations to be signed using KMS! The new signer can be used by specifying the ARN of the KMS key to use, through the --guardianSignerUri commandline argument, as follows:

  • --guardianSignerUri=amazonkms://<ARN>

NOTE For the best possible performance, it is recommended that the Guardian be run from an EC2 instance that is in the same region as the KMS key.

The KMS key's spec should be ECC_SECG_P256K1, and should be enabled for signing. In order for the Guardian to authenticate against the KMS service, one of two options are available:

  • Create new API keys in the AWS console that are permissioned to use the KMS key for signing, and add the keys to the EC2 instance's ~/.aws/credentials file. (example here).
  • Create a role that is permissioned to use the KMS key and attach that role to the Guardian EC2 instance.

Benchmark Signer

The PR also includes a benchmark signer, which wraps any other signer, logging signing and verification latency to prometheus histograms. External signing services might at times introduce unwanted latency, and if an event occurs where observation processing is particularly slow, the histograms would provide insight into whether or not the signing service is to blame.

NOTE
This is a redo of a previous pull request, which Pires spent time looking at. Below are key points following that review that informs the current state of the code:

  • Contexts should be supplied to signing, verification and public key retrieval, as these functions potentially interact with 3rd party services that could timeout or block indefinitely.
  • The GuardianSigner constructor (NewGuardianSignerFromUri) accepts a Context, as the new signer might need to interact with the 3rd party service to validate configurations (such as the AmazonKms signer). A different approach could be to have the constructor define its own context, to avoid the necessity of passing a context to it.

}
}

func (b *BenchmarkSigner) Sign(ctx context.Context, hash []byte) ([]byte, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will errors skew the data? Should you add a metric for errors and bucket the latency separately or exclude them from the latency numbers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! I've added error counters.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following on from Bruce's comment, you're now tracking error counts but the latency figures aren't grouped by success/fail cases. Should we only observe the latency for a successful operation?

bruce-riley
bruce-riley previously approved these changes Nov 27, 2024
Copy link
Collaborator

@pires pires left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been involved in reviewing this before this PR was open so I want to signal I think this is in a good shape to be seriously considered for merge by other more seasoned maintainers.
That said, @andreclaro and xLabs Nodes team will be soaking this in testnet and report back 🙏🏻

}
}

func (b *BenchmarkSigner) Sign(ctx context.Context, hash []byte) ([]byte, error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following on from Bruce's comment, you're now tracking error counts but the latency figures aren't grouped by success/fail cases. Should we only observe the latency for a successful operation?

// an error. This is why the region is first extracted from the keyPath.
cfg, err := config.LoadDefaultConfig(timeoutCtx, config.WithDefaultRegion(amazonKmsSigner.region))
if err != nil {
return nil, errors.New("Failed to load default config")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this error be more descriptive re being related to KMS? i.e. should every error string in this signer include the string "KMS"?


ecdsaPubkey := ecdsa.PublicKey{
X: new(big.Int).SetBytes(asn1Pubkey.PublicKey.Bytes[1 : 1+32]),
Y: new(big.Int).SetBytes(asn1Pubkey.PublicKey.Bytes[1+32:]),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment to explain the byte array indexes?

}

func (a *AmazonKms) Verify(ctx context.Context, sig []byte, hash []byte) (bool, error) {
timeoutCtx, cancel := context.WithTimeout(ctx, time.Second*15)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use KMS_TIMEOUT constant here

}

// adjustBufferSize takes an input buffer and
// a) trims it down to 32 bytes, if the input length is greater than 32, or
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I read "trim it down to 32 bytes" I would guess the least significant bytes are trimmed, but this trims from the front of the array. Maybe worth making this clearer in the comment?

@@ -236,13 +236,15 @@ func getPublicRPCServiceClient(ctx context.Context, addr string) (*grpc.ClientCo
}

func runSignWormchainValidatorAddress(cmd *cobra.Command, args []string) error {
ctx := context.Background()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and elsewhere in some other source files (e.g. message.go) you create new contexts without managing their lifecycle...what's the reasoning for this? Are we concerned about resource leaks?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants