Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-10891. Support Ozone to run ratis group-related command #6716

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

DaveTeng0
Copy link
Contributor

@DaveTeng0 DaveTeng0 commented May 22, 2024

What changes were proposed in this pull request?

Support Ozone to run ratis group-related command

(I currently put this PR as draft because this is pending on another PR for RATIS-2095. (Move common logic of ratis-shell to RaftUtils so that Ozone shell could share and use common logic) to be merged first. Then new methods created in RATIS-2095 would be used here.)

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-10891

How was this patch tested?

unit test

@DaveTeng0
Copy link
Contributor Author

cc. @szetszwo

@DaveTeng0
Copy link
Contributor Author

DaveTeng0 commented May 22, 2024

Hello @fapifta @ChenSammi , I know you guys have been working on ozone's TLS certificate infrastructure for a while, so I'd like to hear from you about the current way I retrieve client side's certificate when Ozone is GRPC-enabled. I currently use a
certClient. getObjectStore() .getClientProxy() .getOzoneManagerClient() .getServiceInfo();
in BaseRatisCommand#createGrpcTlsConf to get the serviceInfo from ozone.
Then I use that serviceInfo to get certificate
CACertificateProvider remoteCAProvider = serviceInfoEx::provideCACerts;
which would be used by the raft client in ozone-shell to connect to all raft servers.
Feel free to let me know if any comment, like other better way to retrieve the certificate, etc. thanks!

Copy link
Contributor

@fapifta fapifta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @DaveTeng0,
thank you for working on this improvement, please find some comments inline related to ClientTrustManager and how trust is build up in Ozone partially.

.getOzoneManagerClient()
.getServiceInfo();

CACertificateProvider remoteCAProvider =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if this remoteProvider definition is intentional or is a misunderstanding of the ClientTrustManager's intents, but let me provide the intent based on which I hope you can decide, and course correct if needed.

The ClientTrustManager is designed in this way to ensure we can give the long running clients a possibility to renew their knowledge of the Ozone cluster's internal rootCA certificate that the client side uses to verify the identity of DataNodes in case of a rootCA rotation that happens some time before the rootCA actually expires, and is a long process by default during which there might be multiple rootCA certificates playing the role of the root of trust (the old one and the new one), after which period the new rootCA becomes the only valid one (but that does not affect the client side which can run further with the two together at the moment).

The current way of initializing the ClientTrustManager here, leads to a state where this renew does not happen, as both the remote and the in-memory provider is the same (the remote basically delegates to the in-memory instance via a lambda expression). This way of operation in the ClientTrustManager can be achieved by specifying the remote provider to null, as in both cases no additional RPC call would be done even if there is verification problem with the certificate (the new DN cert is signed by a different CA).

If this is a short running client which it is (as it runs one or more RPC calls in a short period of time and exits). If we can assume that these RPC calls are idempotent, then we don't need the remote provider, as in case of a rare event when two or more RPCs fall to the different sides on a rootCA rotation in time, then the whole thing can be done once more and it will not fail with a certificate issue the second time.

If this would be a long running client that issues multiple RPCs over a longer period of time, then it does worth it to provide a remote provider other than null, and in this case the best a remote provider can do to refresh the root of trust is to call OM's getServiceInfo method via an RPC, and provide the root(s) of trust based on the result of that RPC call.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @fapifta for the detailed explanation and recommendations regarding the ClientTrustManager and the remoteProvider configuration. In my case, I am indeed dealing with short-running clients that perform one RPC calls and then exit.
Based on your guidance, setting remoteProvider to null is an acceptable approach for my use case. But I'm open to any other suggestions or alternative approaches if other folks might have.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't calculated it exactly but if my memory serves, by default anything under 2 hours of runtime is safe without a remote provider, and in any reasonable setup problems will not arise with a client that does live for just a few seconds or even a few minutes, so yes, in this use case null should be completely fine.

serviceInfoEx::provideCACerts;
ClientTrustManager trustManager = new ClientTrustManager(remoteCAProvider, serviceInfoEx);

return RatisHelper.createTlsClientConfig(new SecurityConfig(config), trustManager);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point we also have a potential problem...

As I understand this PR, the goal is to provide CLI commands that an admin can run and via which the admin can issue commands to a Ratis Group directly from the CLI.

If this understanding of mine is correct, then we may run into a problem with mutual TLS authentication that happens between the raft peers, because the createTlsClientConfig call creates a client side config that does not apply mTLS because the client does not have its own certificate that our internal PKI system trusts, and also it does not have a way to acquire such a certificate from the internal PKI system.

So unless the client here talks to a ratis client port that does not require mTLS but uses other authentication mechanisms like Kerberos, then this approach ain't gonna fly and honestly in this case we need to figure out if we can and how we can allow the client to get its own certificate to use mTLS with the raft server side. (Also we might need to offload this from SCM if these calls are something clients would require potentially often, as we can not afford to generate a lot of certificate on the SCM side as it is a costly operation.) Not to mention the complexity of ensuring that only valid users can grab a cert from the system.
OTOH, if these raft servers do not use mTLS then how they authenticate and identify each others?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think currently the client side does not require mTLS authentication (per comment in https://github.com/apache/ozone/blob/master/hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/ratis/RatisHelper.java#L421-L422). But I could be wrong, does any other folk have additional insights or recommendations regarding the authentication and security aspects of this problem? (cc. @szetszwo @ChenSammi @Galsza

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will let @szetszwo to chime in on this one... if we need mTLS the problem is there, if the API itself does not require authentication, then we should be good to go.

However, I think, if there is no authentication then we need to be sure that there are no administrative commands that can be triggered via the Ratis client and we also need to be sure that there is no sensitive information that are emitted in the response.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants