Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate to v1.6.0-rc.1 #378

Merged
merged 27 commits into from
Aug 16, 2022

Conversation

gkcalat
Copy link
Contributor

@gkcalat gkcalat commented Aug 4, 2022

I apologize for the large PR. Many items needed to be changed in order to test it on GKE 1.22 and 1.21.

  • Tested on GKE 1.21 + Manifests v1.6.0-rc.1 + KFP 2.0.0-alpha.3
  • Tested on GKE 1.22 + Manifests v1.6.0-rc.1 + KFP 2.0.0-alpha.3

@gkcalat gkcalat changed the title Migrate Migrate to v1.6.0-rc.1 Aug 5, 2022
kubeflow/common/iap-ingress/base/ingress.yaml Show resolved Hide resolved
kubeflow/common/iap-ingress/base/config-map.yaml Outdated Show resolved Hide resolved
kubeflow/common/iap-ingress/Makefile Show resolved Hide resolved
@@ -0,0 +1,64 @@
swagger: "2.0"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: Consider adding a comment or README to explain where is this file coming from.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a generic template. It came from the legacy cloud-endpoints-controller, which is no longer maintained. I am afraid that additional reference might confuse people. Let me know what you think.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is true that it might be risky to refer to non-maintained repository. However, we are still using such legacy template within our solution. We should still document where this template is coming from, so other maintainer knows that where to find it and the current state is repository not-maintained. In situations like this, more documentation is better than less documentation.

Would you like to take the following actions?

  1. Create a README file within kubeflow/common/iap-ingress.
  2. Explain where swagger_template.yaml and setup_cloudendpoints.sh are originating from
  3. Explain that the cloud-endpoints-controller repository is no longer maintained.
  4. We can also list potential improvement where we can move away from any legacy logic. I found a few materials below. Feel free to decide whether to include them:

openapi.yaml sample: https://github.com/GoogleCloudPlatform/endpoints-samples/blob/master/k8s/openapi.yaml
Single app example using IAP, GKE, Cloud endpoint: https://github.com/salrashid123/iap_endpoints_app
IAP ingress controller (might be deprecated as well): https://github.com/danisla/iapingress-controller/tree/master

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Please, review.

kubeflow/common/iap-ingress/base/config-map.yaml Outdated Show resolved Hide resolved
Comment on lines +9 to +19
## iap-enabler

[IAP uses](https://cloud.google.com/iap/docs/signed-headers-howto) JSON Web Tokens ([JWT](https://jwt.io/introduction)) to make sure that a request to kubeflow is authorized. This protects kubeflow from IAP being accidentally disabled, misconfigured firewalls, and access from within the project. This *Deployment* applies a RequestAuthentication (**ingress-jwt**) to the kubeflow cluster based on the [policy.yaml template](./base/policy.yaml).

## backend-updater

HTTPS Load Balancing requires a [health check](https://cloud.google.com/load-balancing/docs/health-check-concepts) to determine if backend (**istio-ingressgateway**) responds to traffic. This *StatefulSet* updates the **iap-backendconfig** with the appropriate backend port and backend path for the corresponding health check.

## cloud-endpoints-enabler

This *Deployment* is introduced to replace cloud-endpoints-controller. It [establishes a cloud endpoint](https://cloud.google.com/endpoints/docs/openapi/get-started-kubernetes-engine-espv2) using OpenAPI specification. It uses [swagger_template.yaml](./base/swagger_template.yaml) to build an appropriate OpenAPI spec. This template was used in the original [cloud-endpoint-controller](https://github.com/danisla/cloud-endpoints-controller) (deprecated) in Kubeflow 1.5.1 and earlier.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good writeup! As an additional information, you can share the link to kubeflow/common/iap-ingress/base/config-map.yaml where people can view and update iap-enabler/backend-updater/cloud-endpoints-enabler.

@zijianjoy
Copy link
Collaborator

/lgtm
/approve

It is an awesome implementation in this PR! Great work Ablai!

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gkcalat, zijianjoy

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit 3cdefe5 into GoogleCloudPlatform:master Aug 16, 2022
gcloud endpoints services add-iam-policy-binding ${ENDPOINT_NAME} \
--member serviceAccount:${SERVICE_ACCOUNTNAME} \
--role roles/servicemanagement.serviceController
gcloud projects add-iam-policy-binding ${PROJECT} \
Copy link
Contributor

@fabito fabito Aug 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need roles/cloudtrace.agent ?
If this is a new role for the admin SA at the project level, I think we need to move it to kf-admin-policy.yaml:

apiVersion: iam.cnrm.cloud.google.com/v1beta1
kind: IAMPolicyMember
metadata:
  name: KUBEFLOW-NAME-admin-cloudtraceagent # kpt-set: ${name}-admin-cloudtraceagent
spec:
  member: serviceAccount:[email protected] # kpt-set: serviceAccount:${name}-admin@${gcloud.core.project}.iam.gserviceaccount.com
  role: roles/cloudtrace.agent
  resourceRef:
    apiVersion: resourcemanager.cnrm.cloud.google.com/v1beta1
    kind: Project
    external: projects/PROJECT # kpt-set: projects/${gcloud.core.project}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to enable Cloud Trace for troubleshooting, which we might actually disable for now as it doesn't seem to be a necessary feature. Your suggestion on moving it to the YAML file SGTM though. Thank you for your feedback!

sed "s|JWT_AUDIENCE|${JWT_AUDIENCE}|;s|ENDPOINT_NAME|${ENDPOINT_NAME}|;s|INGRESS_TARGET_IP|${INGRESS_TARGET_IP}|" /var/envoy-config/swagger_template.yaml > openapi.yaml

# Deploy and enable the endpoint
gcloud endpoints services deploy openapi.yaml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my setup, the admin SA does not have enough permissions to execute this.
I had to grant it roles/serviceusage.serviceUsageAdmin (see: https://cloud.google.com/service-usage/docs/access-control#predefined_roles)

Perhaps we need a new entry in kf-admin-policy.yaml ?

Copy link
Contributor

@fabito fabito Aug 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After solving the permission issue above, I have a new endpoint deployed every 30 secs

Screenshot from 2022-08-17 12-59-42

Is this expected ?
Shouldn't we add a check and only deploy if necessary ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @fabito.
Thank you for your feedback!

The purpose of cloud-endpoints-enabler is to create and activate a cloud endpoint during deployment. It's supposed to be deleted at the end of make apply run. The behavior you observed is not intended, as we recommend running make apply and choose necessary components in config.yaml instead of deploying each component separately.

As per permissions, I was not able to reproduce the error you mentioned. Could you create a separate issue with details about your GKE cluster?

Current approach clearly has room for improvement. Your contributions are very welcome!

gcloud config list
gcloud auth list

set_endpoint () {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we exit early if it is already setup ?

Suggested change
set_endpoint () {
set_endpoint () {
gcloud endpoints services describe ${ENDPOINT_NAME}
if [ $? == 0 ] ; then
echo "${ENDPOINT_NAME} cloud endpoint already setup"
return 0
fi

What about scenarios where we want to update it ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of cloud-endpoints-enabler is to create and activate a cloud endpoint during deployment. It's supposed to be deleted at the end of make apply run. We do not support update scenario for now. @zijianjoy mentioned above an idea about having boolean flags to decide when to delete cloud-endpoints-enabler, iap-enabler, and backend-updater workloads.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, got it.

We don't use the makefiles in our setup. We keep the manifests/kustomizations in git and use Fluxcd to update our fleet.
I need to think about how we can better handle the *enabler workloads...

Thanks for the clarification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants