HDDS-10891. Support Ozone to run ratis group-related command #6716

DaveTeng0 · 2024-05-22T20:47:29Z

What changes were proposed in this pull request?

Support Ozone to run ratis group-related command

(I currently put this PR as draft because this is pending on another PR for RATIS-2095. (Move common logic of ratis-shell to RaftUtils so that Ozone shell could share and use common logic) to be merged first. Then new methods created in RATIS-2095 would be used here.)

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-10891

How was this patch tested?

unit test

DaveTeng0 · 2024-05-22T20:56:49Z

cc. @szetszwo

DaveTeng0 · 2024-05-22T21:46:08Z

Hello @fapifta @ChenSammi , I know you guys have been working on ozone's TLS certificate infrastructure for a while, so I'd like to hear from you about the current way I retrieve client side's certificate when Ozone is GRPC-enabled. I currently use a
certClient. getObjectStore() .getClientProxy() .getOzoneManagerClient() .getServiceInfo();
in BaseRatisCommand#createGrpcTlsConf to get the serviceInfo from ozone.
Then I use that serviceInfo to get certificate
CACertificateProvider remoteCAProvider = serviceInfoEx::provideCACerts;
which would be used by the raft client in ozone-shell to connect to all raft servers.
Feel free to let me know if any comment, like other better way to retrieve the certificate, etc. thanks!

fapifta

Hi @DaveTeng0,
thank you for working on this improvement, please find some comments inline related to ClientTrustManager and how trust is build up in Ozone partially.

fapifta · 2024-05-30T23:25:28Z

hadoop-ozone/tools/src/main/java/org/apache/hadoop/ozone/admin/ratis/BaseRatisCommand.java

+        .getOzoneManagerClient()
+        .getServiceInfo();
+
+    CACertificateProvider remoteCAProvider =


I am not sure if this remoteProvider definition is intentional or is a misunderstanding of the ClientTrustManager's intents, but let me provide the intent based on which I hope you can decide, and course correct if needed.

The ClientTrustManager is designed in this way to ensure we can give the long running clients a possibility to renew their knowledge of the Ozone cluster's internal rootCA certificate that the client side uses to verify the identity of DataNodes in case of a rootCA rotation that happens some time before the rootCA actually expires, and is a long process by default during which there might be multiple rootCA certificates playing the role of the root of trust (the old one and the new one), after which period the new rootCA becomes the only valid one (but that does not affect the client side which can run further with the two together at the moment).

The current way of initializing the ClientTrustManager here, leads to a state where this renew does not happen, as both the remote and the in-memory provider is the same (the remote basically delegates to the in-memory instance via a lambda expression). This way of operation in the ClientTrustManager can be achieved by specifying the remote provider to null, as in both cases no additional RPC call would be done even if there is verification problem with the certificate (the new DN cert is signed by a different CA).

If this is a short running client which it is (as it runs one or more RPC calls in a short period of time and exits). If we can assume that these RPC calls are idempotent, then we don't need the remote provider, as in case of a rare event when two or more RPCs fall to the different sides on a rootCA rotation in time, then the whole thing can be done once more and it will not fail with a certificate issue the second time.

If this would be a long running client that issues multiple RPCs over a longer period of time, then it does worth it to provide a remote provider other than null, and in this case the best a remote provider can do to refresh the root of trust is to call OM's getServiceInfo method via an RPC, and provide the root(s) of trust based on the result of that RPC call.

Thanks @fapifta for the detailed explanation and recommendations regarding the ClientTrustManager and the remoteProvider configuration. In my case, I am indeed dealing with short-running clients that perform one RPC calls and then exit.
Based on your guidance, setting remoteProvider to null is an acceptable approach for my use case. But I'm open to any other suggestions or alternative approaches if other folks might have.

I haven't calculated it exactly but if my memory serves, by default anything under 2 hours of runtime is safe without a remote provider, and in any reasonable setup problems will not arise with a client that does live for just a few seconds or even a few minutes, so yes, in this use case null should be completely fine.

fapifta · 2024-05-30T23:37:45Z

hadoop-ozone/tools/src/main/java/org/apache/hadoop/ozone/admin/ratis/BaseRatisCommand.java

+        serviceInfoEx::provideCACerts;
+    ClientTrustManager trustManager = new ClientTrustManager(remoteCAProvider, serviceInfoEx);
+
+    return RatisHelper.createTlsClientConfig(new SecurityConfig(config), trustManager);


At this point we also have a potential problem...

As I understand this PR, the goal is to provide CLI commands that an admin can run and via which the admin can issue commands to a Ratis Group directly from the CLI.

If this understanding of mine is correct, then we may run into a problem with mutual TLS authentication that happens between the raft peers, because the createTlsClientConfig call creates a client side config that does not apply mTLS because the client does not have its own certificate that our internal PKI system trusts, and also it does not have a way to acquire such a certificate from the internal PKI system.

So unless the client here talks to a ratis client port that does not require mTLS but uses other authentication mechanisms like Kerberos, then this approach ain't gonna fly and honestly in this case we need to figure out if we can and how we can allow the client to get its own certificate to use mTLS with the raft server side. (Also we might need to offload this from SCM if these calls are something clients would require potentially often, as we can not afford to generate a lot of certificate on the SCM side as it is a costly operation.) Not to mention the complexity of ensuring that only valid users can grab a cert from the system.
OTOH, if these raft servers do not use mTLS then how they authenticate and identify each others?

I think currently the client side does not require mTLS authentication (per comment in https://github.com/apache/ozone/blob/master/hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/ratis/RatisHelper.java#L421-L422). But I could be wrong, does any other folk have additional insights or recommendations regarding the authentication and security aspects of this problem? (cc. @szetszwo @ChenSammi @Galsza

I will let @szetszwo to chime in on this one... if we need mTLS the problem is there, if the API itself does not require authentication, then we should be good to go.

However, I think, if there is no authentication then we need to be sure that there are no administrative commands that can be triggered via the Ratis client and we also need to be sure that there is no sensitive information that are emitted in the response.

DaveTeng0 added 3 commits May 22, 2024 13:46

HDDS-10891. Support Ozone to run ratis group-related command

0728e45

remove unused import

1f42cd7

add javadoc

9c044eb

DaveTeng0 mentioned this pull request May 22, 2024

RATIS-2095. Extract common logic of ratis-shell to RaftUtils for reuse apache/ratis#1098

Merged

fapifta reviewed May 30, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-10891. Support Ozone to run ratis group-related command #6716

HDDS-10891. Support Ozone to run ratis group-related command #6716

DaveTeng0 commented May 22, 2024 •

edited

Loading

DaveTeng0 commented May 22, 2024

DaveTeng0 commented May 22, 2024 •

edited

Loading

fapifta left a comment

fapifta May 30, 2024

DaveTeng0 Jun 12, 2024

fapifta Jun 12, 2024

fapifta May 30, 2024

DaveTeng0 Jun 12, 2024

fapifta Jun 12, 2024

HDDS-10891. Support Ozone to run ratis group-related command #6716

Are you sure you want to change the base?

HDDS-10891. Support Ozone to run ratis group-related command #6716

Conversation

DaveTeng0 commented May 22, 2024 • edited Loading

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

DaveTeng0 commented May 22, 2024

DaveTeng0 commented May 22, 2024 • edited Loading

fapifta left a comment

Choose a reason for hiding this comment

fapifta May 30, 2024

Choose a reason for hiding this comment

DaveTeng0 Jun 12, 2024

Choose a reason for hiding this comment

fapifta Jun 12, 2024

Choose a reason for hiding this comment

fapifta May 30, 2024

Choose a reason for hiding this comment

DaveTeng0 Jun 12, 2024

Choose a reason for hiding this comment

fapifta Jun 12, 2024

Choose a reason for hiding this comment

DaveTeng0 commented May 22, 2024 •

edited

Loading

DaveTeng0 commented May 22, 2024 •

edited

Loading