Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

External Cluster Environments #1244

Open
wants to merge 32 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
49985f5
add remote cluster functionality
Shrinjay Jan 24, 2023
d491508
Add context selection to helm and passthrough
Shrinjay Jan 25, 2023
686ed39
Remove label
Shrinjay Jan 25, 2023
7e90611
Put back comments
Shrinjay Jan 25, 2023
2daf2cc
Fix names
Shrinjay Jan 25, 2023
ff5aab5
Fix docstrings, generator, and various other
Shrinjay Jan 25, 2023
20a5c56
black fixes
Shrinjay Jan 25, 2023
0d43bec
Add remote autoconfiguration and role
Shrinjay Jan 26, 2023
875fafa
Fix env var checking
Shrinjay Jan 26, 2023
5d1d732
move to namespaced roles for remote clusters
Shrinjay Jan 27, 2023
19f92c7
Formatting and comments
Shrinjay Jan 27, 2023
e0df4e2
Add documentation and cleanup
Shrinjay Jan 27, 2023
5e23382
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 27, 2023
c9b90ce
Fix doc requirements
Shrinjay Jan 30, 2023
0e0ae6f
Merge branch 'feature/remote-cluster' of https://github.com/Shrinjay/…
Shrinjay Jan 30, 2023
9a9416e
Merge branch 'main' into feature/remote-cluster
Shrinjay Jan 30, 2023
9ba9ab7
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 30, 2023
0a4c6fb
Change to external cluster naming
Shrinjay Jan 31, 2023
d8b55b6
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 31, 2023
94ab66e
Chart lint fixes
Shrinjay Jan 31, 2023
3514129
Merge branch 'feature/remote-cluster' of https://github.com/Shrinjay/…
Shrinjay Jan 31, 2023
4fbe8b7
Update docs/source/operators/deploy-kubernetes.md
Shrinjay Apr 14, 2023
51256b9
Update docs/source/operators/deploy-kubernetes.md
Shrinjay Apr 14, 2023
8107dac
Update enterprise_gateway/services/processproxies/crd.py
Shrinjay Apr 14, 2023
26cba1a
Apply suggestions from code review
Shrinjay Apr 14, 2023
55065ca
Update docs/source/operators/deploy-kubernetes.md
Shrinjay Apr 14, 2023
3282e14
Update etc/kubernetes/helm/enterprise-gateway/templates/deployment.yaml
Shrinjay Apr 14, 2023
047538e
progress
Shrinjay Apr 14, 2023
7c06f2b
throw warning
Shrinjay Apr 14, 2023
c909106
merge latest
Shrinjay Apr 14, 2023
a1a9e46
Merge branch 'main' into feature/remote-cluster
Shrinjay Apr 14, 2023
8554343
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Apr 14, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 28 additions & 1 deletion docs/source/operators/deploy-kubernetes.md
Original file line number Diff line number Diff line change
Expand Up @@ -522,7 +522,11 @@ can override them with helm's `--set` or `--values` options. Always use `--set`
| `kip.defaultContainerRegistry` | Prefix to use if a registry is not already specified on image name (e.g., elyra/kernel-py:VERSION) | `docker.io` |
| `kip.fetcher` | fetcher to fetch image names, defaults to KernelSpecsFetcher | `KernelSpecsFetcher` |
| `kip.images` | if StaticListFetcher is used KIP_IMAGES defines the list of images pullers will fetch | `[]` |
| `kip.internalFetcher ` | if CombinedImagesFetcher is used KIP_INTERNAL_FETCHERS defines the fetchers that get used internally | `KernelSpecsFetcher` |
| `kip.internalFetcher ` | if CombinedImagesFetcher is used KIP_INTERNAL_FETCHERS defines the fetchers that get used internally | |
| `externalCluster.enable` | Launch kernels in a remote cluster. Used for multi-cluster environments. **Must place a kubeconfig file in the `config/` folder of the helm chart**. | `false` |
| `externalCluster.configPath` | Path to mount kubeconfig at | `/etc/kube/config` |
| `externalCluster.configFilename` | Filename to kubeconfig file inside `config/` directory of chart | `kubeconfig` |
| `externalCluster.autoConfigureRemote` | Automatically create service account in remote cluster | |

## Uninstalling Enterprise Gateway

Expand Down Expand Up @@ -958,6 +962,29 @@ Of particular importance is the mapping to port `8888` (e.g.,`32422`). If you ar

The value of the `JUPYTER_GATEWAY_URL` used by the gateway-enabled Notebook server will vary depending on whether you choose to define an external IP or not. If and external IP is defined, you'll set `JUPYTER_GATEWAY_URL=<externalIP>:8888` else you'll set `JUPYTER_GATEWAY_URL=<k8s-master>:32422` **but also need to restart clients each time Enterprise Gateway is started.** As a result, use of the `externalIPs:` value is highly recommended.

## Multi-Cluster Environments

### Overview

With `externalCluster.enabled` set to `true`, Enterprise Gateway can be used on multi-cluster environments where the jupyter enterprise gateway pods and kernel pods are launched on separate clusters. To configure this:

1. Ensure your two clusters have interconnceted networks. Pods in the two clusters must be able to communicate with each other over pod IP alone.
1. Provide a kubeconfig file for use in the `config/` subdirectory of `etc/kubernetes/helm/enterprise-gateway` chart.
1. Set `externalCluster.enabled` to `true`.

Enterprise Gateway will now launch kernel pods in whichever cluster you have set to default in your kubeconfig.

### Resources in Remote Clusters

For Enterprise Gateway to work across clusters, Enterprise Gateway must create the following resources in the cluster your kernels will be launched on.

- The kernel resource.
- A service account for the kernel pods (if `externalCluster.autoConfigureRemote` is set to `true`).
- A namespaced role for the namespace where your kernel pods will be launched.
- A role binding between your namespaced role and your service account.

The role resource is defined in the `templates/kernel-role.yaml` template of the helm chart. Permissions can be set there.

## Kubernetes Tips

The following items illustrate some useful commands for navigating Enterprise Gateway within a kubernetes environment.
Expand Down
6 changes: 5 additions & 1 deletion enterprise_gateway/services/processproxies/crd.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@

from kubernetes import client

from enterprise_gateway.services.processproxies.k8s_client import kubernetes_client

from ..kernels.remotemanager import RemoteKernelManager
from .k8s import KubernetesProcessProxy

Expand Down Expand Up @@ -105,7 +107,9 @@ def delete_managed_object(self, termination_stati: list[str]) -> bool:

Note: the caller is responsible for handling exceptions.
"""
delete_status = client.CustomObjectsApi().delete_namespaced_custom_object(
delete_status = client.CustomObjectsApi(
api_client=kubernetes_client
).delete_namespaced_custom_object(
self.group,
self.version,
self.kernel_namespace,
Expand Down
102 changes: 89 additions & 13 deletions enterprise_gateway/services/processproxies/k8s.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,10 +10,15 @@
from typing import Any

import urllib3
from kubernetes import client, config
import yaml
from kubernetes import client
from kubernetes.utils.create_from_yaml import create_from_yaml_single_item

from enterprise_gateway.services.processproxies.k8s_client import kubernetes_client

from ..kernels.remotemanager import RemoteKernelManager
from ..sessions.kernelsessionmanager import KernelSessionManager
from ..utils.envutils import is_env_true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we refactor this filename to include an underscore for word separation?

Suggested change
from ..utils.envutils import is_env_true
from ..utils.env_utils import is_env_true

from .container import ContainerProcessProxy

urllib3.disable_warnings()
Expand All @@ -29,8 +34,6 @@
share_gateway_namespace = bool(os.environ.get("EG_SHARED_NAMESPACE", "False").lower() == "true")
kpt_dir = os.environ.get("EG_POD_TEMPLATE_DIR", "/tmp") # noqa

config.load_incluster_config()


class KubernetesProcessProxy(ContainerProcessProxy):
"""
Expand All @@ -56,6 +59,12 @@ async def launch_process(
) -> KubernetesProcessProxy:
"""Launches the specified process within a Kubernetes environment."""
# Set env before superclass call, so we can see these in the debug output
use_remote_cluster = os.getenv("EG_USE_REMOTE_CLUSTER")
if use_remote_cluster:
kwargs["env"]["EG_USE_REMOTE_CLUSTER"] = 'true'
kwargs["env"]["EG_REMOTE_CLUSTER_KUBECONFIG_PATH"] = os.getenv(
"EG_REMOTE_CLUSTER_KUBECONFIG_PATH"
)

# Kubernetes relies on internal env variables to determine its configuration. When
# running within a K8s cluster, these start with KUBERNETES_SERVICE, otherwise look
Expand Down Expand Up @@ -85,7 +94,7 @@ def get_container_status(self, iteration: int | None) -> str:
# is used for the assigned_ip.
pod_status = ""
kernel_label_selector = "kernel_id=" + self.kernel_id + ",component=kernel"
ret = client.CoreV1Api().list_namespaced_pod(
ret = client.CoreV1Api(api_client=kubernetes_client).list_namespaced_pod(
namespace=self.kernel_namespace, label_selector=kernel_label_selector
)
if ret and ret.items:
Expand Down Expand Up @@ -121,7 +130,7 @@ def delete_managed_object(self, termination_stati: list[str]) -> bool:
# Deleting a Pod will return a v1.Pod if found and its status will be a PodStatus containing
# a phase string property
# https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.21/#podstatus-v1-core
v1_pod = client.CoreV1Api().delete_namespaced_pod(
v1_pod = client.CoreV1Api(api_client=kubernetes_client).delete_namespaced_pod(
namespace=self.kernel_namespace, body=body, name=self.container_name
)
status = None
Expand Down Expand Up @@ -168,7 +177,7 @@ def terminate_container_resources(self) -> bool | None:
body = client.V1DeleteOptions(
grace_period_seconds=0, propagation_policy="Background"
)
v1_status = client.CoreV1Api().delete_namespace(
v1_status = client.CoreV1Api(api_client=kubernetes_client).delete_namespace(
name=self.kernel_namespace, body=body
)
status = None
Expand Down Expand Up @@ -239,7 +248,7 @@ def _determine_kernel_namespace(self, **kwargs: dict[str, Any] | None) -> str:

# If KERNEL_NAMESPACE was provided, then we assume it already exists. If not provided, then we'll
# create the namespace and record that we'll want to delete it as well.
namespace = kwargs["env"].get("KERNEL_NAMESPACE")
namespace = os.environ.get("KERNEL_NAMESPACE")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prior to launch, KERNEL_NAMESPACE must be pulled from the start request arguments (kwargs["env"]) rather than the EG process env, so this change needs to be reverted.

Suggested change
namespace = os.environ.get("KERNEL_NAMESPACE")
namespace = kwargs["env"].get("KERNEL_NAMESPACE")

if namespace is None:
# check if share gateway namespace is configured...
if share_gateway_namespace: # if so, set to EG namespace
Expand Down Expand Up @@ -283,10 +292,17 @@ def _create_kernel_namespace(self, service_account_name: str) -> str:

# create the namespace
try:
client.CoreV1Api().create_namespace(body=body)
client.CoreV1Api(api_client=kubernetes_client).create_namespace(body=body)
self.delete_kernel_namespace = True
self.log.info(f"Created kernel namespace: {namespace}")

# If remote cluster is being used, service account may not be present, create before role binding
# If creating service account is disabled, operator must manually create svc account
if is_env_true('EG_USE_REMOTE_CLUSTER') and is_env_true('EG_CREATE_REMOTE_SVC_ACCOUNT'):
self._create_service_account_if_not_exists(
namespace=namespace, service_account_name=service_account_name
)

# Now create a RoleBinding for this namespace for the default ServiceAccount. We'll reference
# the ClusterRole, but that will only be applied for this namespace. This prevents the need for
# creating a role each time.
Expand All @@ -310,14 +326,66 @@ def _create_kernel_namespace(self, service_account_name: str) -> str:
body = client.V1DeleteOptions(
grace_period_seconds=0, propagation_policy="Background"
)
client.CoreV1Api().delete_namespace(name=namespace, body=body)
client.CoreV1Api(api_client=kubernetes_client).delete_namespace(
name=namespace, body=body
)
self.log.warning(f"Deleted kernel namespace: {namespace}")
else:
reason = f"Error occurred creating namespace '{namespace}': {err}"
self.log_and_raise(http_status_code=500, reason=reason)

return namespace

def _create_service_account_if_not_exists(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is strictly for external clusters, could we rename this to something like: _create_external_service_account_if_not_exists or _create_remote_service_account_if_not_exists. Since the other configurable options refer to "external", I guess I'd prefer the former.

self, namespace: str, service_account_name: str
) -> None:
"""If service account doesn't exist in target cluster, create one. Occurs if a remote cluster is being used."""
service_account_list_in_namespace: client.V1ServiceAccountList = client.CoreV1Api(
api_client=kubernetes_client
).list_namespaced_service_account(namespace=namespace)

service_accounts_in_namespace: list[
client.V1ServiceAccount
] = service_account_list_in_namespace.items
service_account_names_in_namespace: list[str] = [
svcaccount.metadata.name for svcaccount in service_accounts_in_namespace
]

if service_account_name not in service_account_names_in_namespace:
service_account_metadata = {"name": service_account_name}
service_account_to_create: client.V1ServiceAccount = client.V1ServiceAccount(
kind="ServiceAccount", metadata=service_account_metadata
)

client.CoreV1Api(api_client=kubernetes_client).create_namespaced_service_account(
namespace=namespace, body=service_account_to_create
)

self.log.info(
f"Created service account {service_account_name} in namespace {namespace}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way to access the "name" of the external cluster? Seems really helpful to include that here.

Suggested change
f"Created service account {service_account_name} in namespace {namespace}"
f"Created service account {service_account_name} in namespace {namespace} of external cluster {external_cluster}"

)

def _create_role_if_not_exists(self, namespace: str) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A similar name change to that suggested previously would be nice here.

"""If role doesn't exist in target cluster, create one. Occurs if a remote cluster is being used"""
role_yaml_path = os.getenv('EG_REMOTE_CLUSTER_ROLE_PATH')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should role_yaml_path be validated for None and the file's existence?

This line (and any validation) should be moved within the if kernel_cluster_role not in remote_cluster_role_names: block at L379 so that it's only performed if needed.


# Get Roles in remote cluster
remote_cluster_roles: client.V1RoleList = client.RbacAuthorizationV1Api(
api_client=kubernetes_client
).list_namespaced_role(namespace=namespace)
remote_cluster_role_names = [role.metadata.name for role in remote_cluster_roles.items]

# If the kernel Role does not exist in the remote cluster.
if kernel_cluster_role not in remote_cluster_role_names:
with open(role_yaml_path) as f:
role_yaml = yaml.safe_load(f)
role_yaml["metadata"]["namespace"] = namespace
create_from_yaml_single_item(yml_object=role_yaml, k8s_client=kubernetes_client)

self.log.info(f"Created role {kernel_cluster_role} in namespace {namespace}")
else:
self.log.info(f"Found role {kernel_cluster_role} in namespace {namespace}")

def _create_role_binding(self, namespace: str, service_account_name: str) -> None:
# Creates RoleBinding instance for the given namespace. The role used will be the ClusterRole named by
# EG_KERNEL_CLUSTER_ROLE.
Expand All @@ -330,9 +398,17 @@ def _create_role_binding(self, namespace: str, service_account_name: str) -> Non
role_binding_name = kernel_cluster_role # use same name for binding as cluster role
labels = {"app": "enterprise-gateway", "component": "kernel", "kernel_id": self.kernel_id}
binding_metadata = client.V1ObjectMeta(name=role_binding_name, labels=labels)
binding_role_ref = client.V1RoleRef(
api_group="", kind="ClusterRole", name=kernel_cluster_role
)

# If remote cluster is used, we need to create a role on that cluster
if is_env_true('EG_USE_REMOTE_CLUSTER'):
self._create_role_if_not_exists(namespace=namespace)
# We use namespaced roles on remote clusters rather than a ClusterRole
binding_role_ref = client.V1RoleRef(api_group="", kind="Role", name=kernel_cluster_role)
else:
binding_role_ref = client.V1RoleRef(
api_group="", kind="ClusterRole", name=kernel_cluster_role
)

binding_subjects = client.V1Subject(
api_group="", kind="ServiceAccount", name=service_account_name, namespace=namespace
)
Expand All @@ -344,7 +420,7 @@ def _create_role_binding(self, namespace: str, service_account_name: str) -> Non
subjects=[binding_subjects],
)

client.RbacAuthorizationV1Api().create_namespaced_role_binding(
client.RbacAuthorizationV1Api(api_client=kubernetes_client).create_namespaced_role_binding(
namespace=namespace, body=body
)
self.log.info(
Expand Down
5 changes: 5 additions & 0 deletions enterprise_gateway/services/processproxies/k8s_client.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
"""Instantiates a static global factory and a single atomic client"""
from enterprise_gateway.services.processproxies.k8s_client_factory import KubernetesClientFactory

KUBERNETES_CLIENT_FACTORY = KubernetesClientFactory()
kubernetes_client = KUBERNETES_CLIENT_FACTORY.get_kubernetes_client()
42 changes: 42 additions & 0 deletions enterprise_gateway/services/processproxies/k8s_client_factory.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
"""Contains factory to create kubernetes api client instances using a single confguration"""
import os

from kubernetes import client, config
from traitlets.config import SingletonConfigurable

from enterprise_gateway.services.utils.envutils import is_env_true


class KubernetesClientFactory(SingletonConfigurable):
"""Manages kubernetes client creation from environment variables"""

def get_kubernetes_client(self) -> client.ApiClient:
"""Get kubernetes api client with appropriate configuration
Returns:
ApiClient: Kubernetes API client for appropriate cluster
"""
kubernetes_config: client.Configuration = client.Configuration()
if os.getenv("KUBERNETES_SERVICE_HOST"):
# Running inside cluster
if is_env_true('EG_USE_REMOTE_CLUSTER') and not is_env_true('EG_SHARED_NAMESPACE'):
kubeconfig_path = os.getenv(
'EG_REMOTE_CLUSTER_KUBECONFIG_PATH', '/etc/kube/config/kubeconfig'
)
context = os.getenv('EG_REMOTE_CLUSTER_CONTEXT', None)
config.load_kube_config(
client_configuration=kubernetes_config,
config_file=kubeconfig_path,
context=context,
)
else:
if is_env_true('EG_USE_REMOTE_CLUSTER'):
self.log.warning(
"Cannot use EG_USE_REMOTE_CLUSTER and EG_SHARED_NAMESPACE at the same time. Using local cluster...."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

)

config.load_incluster_config(client_configuration=kubernetes_config)
else:
config.load_kube_config(client_configuration=kubernetes_config)

self.log.debug(f"Created kubernetes client for host {kubernetes_config.host}")
return client.ApiClient(kubernetes_config)
Empty file.
7 changes: 7 additions & 0 deletions enterprise_gateway/services/utils/envutils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
"""""Utilities to make checking environment variables easier"""
import os


def is_env_true(env_variable_name: str) -> bool:
"""If environment variable is set and value is case-insensitively "true", then return true. Else return false"""
return bool(os.getenv(env_variable_name, "False").lower() == "true")
14 changes: 14 additions & 0 deletions etc/docker/enterprise-gateway/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,20 @@ RUN apt update && apt install -yq curl openjdk-8-jdk
ENV JAVA_HOME /usr/lib/jvm/java
RUN ln -s $(readlink -f /usr/bin/javac | sed "s:/bin/javac::") ${JAVA_HOME}

RUN curl https://apt.releases.teleport.dev/gpg \
-o /usr/share/keyrings/teleport-archive-keyring.asc

RUN source /etc/os-release
RUN echo "deb [signed-by=/usr/share/keyrings/teleport-archive-keyring.asc] \
https://apt.releases.teleport.dev/ubuntu jammy stable/v11" \
| tee /etc/apt/sources.list.d/teleport.list > /dev/null

RUN apt-get update && apt-get install teleport

# Download and install kubectl
RUN curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
RUN sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl

# Download and install Spark
RUN curl -s https://archive.apache.org/dist/spark/spark-${SPARK_VER}/spark-${SPARK_VER}-bin-hadoop2.7.tgz | \
tar -xz -C /opt && \
Expand Down
Loading