Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added support for ingress over routes on cluster creation #251

Merged
merged 6 commits into from
Oct 29, 2023

Conversation

Bobbins228
Copy link
Contributor

@Bobbins228 Bobbins228 commented Jul 26, 2023

Issue link

Closes #219

What changes have been made

Added an ingressOptions dict for specifying what options you want to set for your dashboard ingress.
An example of the specified ingressOptions would look like this (based off the routes for dashboard)

cluster = Cluster(ClusterConfiguration(local_interactive=local_interactive, namespace=namespace, name=cluster_name, num_workers=1, min_cpus=1, max_cpus=1, min_memory=4, max_memory=4, num_gpus=0, instascale=False, machine_types=["m5.xlarge", "p3.8xlarge"],
ingress_options = {
    "ingresses": [
        {
            "ingressName": "ray-dashboard-test",
            "port": 8265,
            "pathType": "Prefix",
            "path": "/",
            "host":"<your_host>",
        },
        }
    ]
}))

Without specifying any ingress options the default dashboard and client ingresses will be created provided you have set the ingressDomain (note this is only a Kubernetes requirement)

Important steps for Kind Clusters

Add the - --enable-ssl-passthrough argument to the nginx ingress controller
Add the ingress_domain field in your cluster configuration to your local domain if you are not using ingress_options

Verification steps

Specify your ingress Options in ClusterConfiguration
Run cluster.up() after authenticating in a Jupytr Notebook
Run kubectl get ingress
You should get ingresses you specified
upon running cluster.details() you should be able to follow the link to the ray dash board provided you created an ingress for it or omitted ingressOptions.
running cluster.down() will delete the ingress.

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • Testing is not required for this change

Copy link
Member

@tedhtchang tedhtchang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Bobbins228 The PR seems to replace Route with Ingress completely. I think we still want to support Route for Openshift unless Openshift supports Ingress. I have not tried this.

@Bobbins228
Copy link
Contributor Author

@tedhtchang OpenShift supports ingress so we don't really need to use routes anymore because of this.

Tested this out myself and the ingress works on OpenShift

@MichaelClifford
Copy link
Collaborator

LGTM!

@Maxusmusti
Copy link
Collaborator

Also needs a rebase while addressing Ted's comments

Copy link
Member

@tedhtchang tedhtchang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The local interactive did not work for me.
Couple of reasons I can see:

  1. I think the ray client route/ingress needs to be something likerayclient-hfgputest-1-default.apps.ted413med.cp.fyre.ibm.com(rayclient without dash) because it needs to match the Subject Alternative Name in the server.crt in the ray pods. For example:
# kubectl exec -it hfgputest-1-head-9hlgk -- bash
openssl x509 -text -noout -in /home/ray/workspace/tls/server.crt
...
            X509v3 Subject Alternative Name:
                DNS:127.0.0.1, DNS:localhost, DNS:hfgputest-1-head-svc.default.svc.cluster.local, DNS:10.254.16.36, DNS:rayclient-hfgputest-1-default.apps.ted413med.cp.fyre.ibm.com
  1. The ray client route termination needs be passthrough. For example:
NAME                        HOST/PORT                                                          PATH   SERVICES               PORT        TERMINATION   WILDCARD
ray-dashboard-hfgputest-1   ray-dashboard-hfgputest-1-default.apps.ted413med.cp.fyre.ibm.com          hfgputest-1-head-svc   dashboard                 None
rayclient-hfgputest-1       rayclient-hfgputest-1-default.apps.ted413med.cp.fyre.ibm.com              hfgputest-1-head-svc   client      passthrough   None

This ingress should give you some clue how to enable passthrough for grpc. It could take a different form to translate this ingress to route.

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 22, 2023
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 25, 2023
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 31, 2023
Copy link
Member

@tedhtchang tedhtchang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @Bobbins228 I verified this worked on my OCP and Openshift local but like you said this api

      api_client = client.CustomObjectsApi(api_config_handler())
      ingress = api_client.get_cluster_custom_object(
          "config.openshift.io", "v1", "ingresses", "cluster"
      )
    except Exception as e:  # pragma: no cover
        return _kube_api_error_handling(e)
    domain = ingress["spec"]["domain"]

is not supported on KinD(k8s). We will need to find a way to return a domain name.

@Maxusmusti
Copy link
Collaborator

Can this config.openshift.io be replaced with something that is compatible with both K8S ingress and OCP ingress? Or do we need a more complex solution here

@Maxusmusti
Copy link
Collaborator

Perhaps instead of using a custom_object, using the Kubernetes API's actuall ingress API objects may do the work we need internally?

@Maxusmusti
Copy link
Collaborator

@tedhtchang
Copy link
Member

tedhtchang commented Sep 5, 2023

I might have mistaken the Openshift crd ingresses.config.openshift.io and k8s nativeingress
On OCP and by default, there is a cluster ingresses.config cr obj which we extract the cluster domain from. For example:
oc get ingresses.config cluster -oyaml

apiVersion: config.openshift.io/v1
kind: Ingress
metadata:
  name: cluster
spec:
  domain: apps-crc.testing
  loadBalancer:
    platform:
      type: ""

Then we can use domain to assemble host for the ingress obj in our base-template. For example:

        apiVersion: networking.k8s.io/v1
        kind: Ingress
        metadata:
          name: rayclient-deployment-name
          namespace: default
          annotations:
            route.openshift.io/termination: passthrough
            nginx.ingress.kubernetes.io/ssl-passthrough: "true"
            nginx.ingress.kubernetes.io/backend-protocol: "GRPC"
          labels:
            odh-ray-cluster-service: deployment-name-head-svc
        spec:
          ingressClassName: openshift-default
          rules:
            - host: rayclient-raytest.apps-crc.testing
              http:
                paths:
                - path: ''
                  pathType: ImplementationSpecific
                  backend:
                    service:
                      name: deployment-name-head-svc
                      port:
                        number: 10001

For KinD or K8s (KinD) we don't have the ingresses.config crd. We may need a different code path for: If KinD, get the cluster domain from hostname of workstation where the KinD is installed. For local development, this should work.

@tedhtchang
Copy link
Member

Not related to return cluster domain but FYI tutorial for set up local domain on KinD for development https://mjpitz.com/blog/2020/10/21/local-ingress-domains-kind/

@tedhtchang
Copy link
Member

tedhtchang commented Sep 13, 2023

I have not find a way for the KinD cluster to return the ingress domain.
I think we could:

  1. if ingresses.config.openshift.io GVK exists is true then call this get the domain.
  2. Else get the domain from a new ClusterConfiguration param i.e.ingress_domain. For local KinD environment, this could be the hostname such as on a VM ClusterConfiguration(ingress_domain=tedy1.fyre.ibm.com). For cloud provider such as IKS, it's usually something like xxxx.000000001111111222233334444-555.us-south.containers.appdomain.cloud which users can easily find on proivder's web console.

@blublinsky
Copy link

KubeRay operator already provides support for creating and managing the Ingress (and route in the main branch). Why do we create this externally to the operator? Would it be easier to just add this to the KubeRay yaml?

@tedhtchang
Copy link
Member

tedhtchang commented Sep 14, 2023

@blublinsky This PR also converts the ray client route into ingress. The sdk optionally creates a ray client route for the advanced users to interact with the the ray cluster from their own laptop directly. The ray client is gRPC(port 10001) and requires TLS and passthrough to be enabled on a route.

However, the dashboard route support for OCP was created a while ago. It's possible to convert to the KubeRay native feature you implemented recently If the resulting dashboard route object meet the requirement. We should create another thread to discuss with @Maxusmusti and @Bobbins228 to understand the requirements.

@Maxusmusti
Copy link
Collaborator

Be sure to also include @KPostOffice in any future threads. He also left some comments on the last slack thread that may be worth a read

@KPostOffice
Copy link
Collaborator

https://github.com/project-codeflare/codeflare-sdk/pull/251/files#diff-279fd895cfcb8295f1b610826f8eb7a40fd543ff873661acd9dc452cef3d9d14L416

Here when enabling OAuth sidecar the 1 index of generic items is deleted

Copy link
Member

@tedhtchang tedhtchang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/LGTM thanks @Bobbins228

@openshift-ci openshift-ci bot added lgtm Indicates that a PR is ready to be merged. and removed lgtm Indicates that a PR is ready to be merged. labels Oct 24, 2023
Copy link
Member

@tedhtchang tedhtchang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/LGTM I think this is just a rebased.
Update: Thanks you actually removed more unused code. I will test out again.
Update2: It works for me on KinD. Maybe someone can try on OpenShift.

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 25, 2023
@tedhtchang
Copy link
Member

/LGTM cancel
@Bobbins228
I got an error when connecting to the rayclient url on OCP. Could we have another person @KPostOffice @Maxusmusti verify this PR on their OCP cluster. instruction

2023-10-26 17:30:20,045 DEBUG worker.py:813 -- Pinging server.
2023-10-26 17:30:20,047 DEBUG worker.py:378 -- client gRPC channel state change: ChannelConnectivity.CONNECTING
2023-10-26 17:30:23,132 DEBUG worker.py:378 -- client gRPC channel state change: ChannelConnectivity.TRANSIENT_FAILURE
2023-10-26 17:30:23,133 WARNING dataclient.py:403 -- Encountered connection issues in the data channel. Attempting to reconnect.
2023-10-26 17:30:23,338 DEBUG worker.py:378 -- client gRPC channel state change: ChannelConnectivity.IDLE
2023-10-26 17:30:23,545 DEBUG worker.py:378 -- client gRPC channel state change: ChannelConnectivity.CONNECTING
2023-10-26 17:30:23,654 DEBUG worker.py:378 -- client gRPC channel state change: ChannelConnectivity.TRANSIENT_FAILURE
2023-10-26 17:30:28,340 DEBUG worker.py:226 -- Couldn't connect channel in 5 seconds, retrying
2023-10-26 17:30:28,340 DEBUG worker.py:237 -- Waiting for Ray to become ready on the server, retry in 5s...
2023-10-26 17:30:33,346 DEBUG worker.py:226 -- Couldn't connect channel in 5 seconds, retrying
2023-10-26 17:30:33,348 DEBUG worker.py:237 -- Waiting for Ray to become ready on the server, retry in 5s...
2023-10-26 17:30:38,353 DEBUG worker.py:226 -- Couldn't connect channel in 5 seconds, retrying
2023-10-26 17:30:38,354 DEBUG worker.py:237 -- Waiting for Ray to become ready on the server, retry in 5s...
2023-10-26 17:30:40,380 DEBUG worker.py:378 -- client gRPC channel state change: ChannelConnectivity.READY
2023-10-26 17:30:40,381 DEBUG worker.py:813 -- Pinging server.
2023-10-26 17:30:40,395 DEBUG dataclient.py:412 -- Reconnection succeeded!
2023-10-26 17:30:40,406 ERROR dataclient.py:330 -- Unrecoverable error in data channel.
2023-10-26 17:30:40,407 DEBUG dataclient.py:331 -- <_MultiThreadedRendezvous of RPC that terminated with:
  status = StatusCode.NOT_FOUND
  details = "Attempted to reconnect a session that has already been cleaned up"
  debug_error_string = "UNKNOWN:Error received from peer  {grpc_message:"Attempted to reconnect a session that has already been cleaned up", grpc_status:5, created_time:"2023-10-26T17:30:40.40596-07:00"}"
>
2023-10-26 17:30:40,407 DEBUG dataclient.py:285 -- Shutting down data channel.

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Oct 27, 2023
Copy link
Member

@tedhtchang tedhtchang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not able to verify the local interactive client on OCP but @KPostOffice was to able successfully verify the PR. /LGTM
Update: I verified on the OCP cluster from my Linux VM.

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 27, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 29, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: KPostOffice, tedhtchang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 29, 2023
@openshift-ci openshift-ci bot merged commit 3aa1919 into project-codeflare:main Oct 29, 2023
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support use of ingress for accessing ray dashboard when ray cluster is created.
8 participants