Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Current namespace not used when creating daskcluster in k8s #921

Open
john-jam opened this issue Dec 4, 2024 · 9 comments
Open

Current namespace not used when creating daskcluster in k8s #921

john-jam opened this issue Dec 4, 2024 · 9 comments
Labels
needs info Needs further information from the user

Comments

@john-jam
Copy link
Contributor

john-jam commented Dec 4, 2024

Describe the issue:

When using a dask operator deployment in k8s with the role/rolebinding defined at the namespace level (rbac.cluster: false), the creation of a daskclusters.kubernetes.dask.org by a service account (dask in the example) inside a pod within a namespace (myns in the example) leads to the following error:

Short Error Message:

User "system:serviceaccount:myns:dask" cannot create resource "daskclusters" in API group "kubernetes.dask.org" in the namespace "default"
...
User "system:serviceaccount:myns:dask" cannot list resource "daskclusters" in API group "kubernetes.dask.org" in the namespace "default"

Full Stacktrace:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/kr8s/_api.py", line 168, in call_api
    response.raise_for_status()
  File "/usr/local/lib/python3.11/site-packages/httpx/_models.py", line 829, in raise_for_status
    raise HTTPStatusError(message, request=request, response=self)
httpx.HTTPStatusError: Client error '403 Forbidden' for url 'https://.../apis/kubernetes.dask.org/v1/namespaces/default/daskclusters'
For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/403

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/prefect/engine.py", line 42, in <module>
    run_flow(flow, flow_run=flow_run)
  File "/usr/local/lib/python3.11/site-packages/prefect/flow_engine.py", line 1453, in run_flow
    return run_flow_sync(**kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prefect/flow_engine.py", line 1333, in run_flow_sync
    return engine.state if return_type == "state" else engine.result()
                                                       ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prefect/flow_engine.py", line 313, in result
    raise self._raised
  File "/usr/local/lib/python3.11/site-packages/prefect/flow_engine.py", line 721, in run_context
    yield self
  File "/usr/local/lib/python3.11/site-packages/prefect/flow_engine.py", line 1331, in run_flow_sync
    engine.call_flow_fn()
  File "/usr/local/lib/python3.11/site-packages/prefect/flow_engine.py", line 744, in call_flow_fn
    result = call_with_parameters(self.flow.fn, self.parameters)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prefect/utilities/callables.py", line 206, in call_with_parameters
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/workdir/examples/cs/flows/misc/run_on_dask/flow.py", line 43, in run_on_dask
    cluster = KubeCluster(
              ^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/dask_kubernetes/operator/kubecluster/kubecluster.py", line 282, in __init__
    self.sync(self._start)
  File "/usr/local/lib/python3.11/site-packages/distributed/utils.py", line 363, in sync
    return sync(
           ^^^^^
  File "/usr/local/lib/python3.11/site-packages/distributed/utils.py", line 439, in sync
    raise error
  File "/usr/local/lib/python3.11/site-packages/distributed/utils.py", line 413, in f
    result = yield future
             ^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/tornado/gen.py", line 766, in run
    value = future.result()
            ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/dask_kubernetes/operator/kubecluster/kubecluster.py", line 322, in _start
    await self._create_cluster()
  File "/usr/local/lib/python3.11/site-packages/dask_kubernetes/operator/kubecluster/kubecluster.py", line 361, in _create_cluster
    await cluster.create()
  File "/usr/local/lib/python3.11/site-packages/kr8s/_objects.py", line 320, in create
    async with self.api.call_api(
  File "/usr/local/lib/python3.11/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/kr8s/_api.py", line 186, in call_api
    raise ServerError(
kr8s._exceptions.ServerError: daskclusters.kubernetes.dask.org is forbidden: User "system:serviceaccount:myns:dask" cannot create resource "daskclusters" in API group "kubernetes.dask.org" in the namespace "default"
Exception ignored in atexit callback: <function reap_clusters at 0x7afb85dafe20>
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/dask_kubernetes/operator/kubecluster/kubecluster.py", line 1033, in reap_clusters
    asyncio.run(_reap_clusters())
  File "/usr/local/lib/python3.11/asyncio/runners.py", line 190, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/asyncio/base_events.py", line 654, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/dask_kubernetes/operator/kubecluster/kubecluster.py", line 1031, in _reap_clusters
    cluster.close(timeout=10)
  File "/usr/local/lib/python3.11/site-packages/dask_kubernetes/operator/kubecluster/kubecluster.py", line 700, in close
    return self.sync(self._close, timeout=timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/distributed/utils.py", line 363, in sync
    return sync(
           ^^^^^
  File "/usr/local/lib/python3.11/site-packages/distributed/utils.py", line 439, in sync
    raise error
  File "/usr/local/lib/python3.11/site-packages/distributed/utils.py", line 413, in f
    result = yield future
             ^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/tornado/gen.py", line 766, in run
    value = future.result()
            ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/dask_kubernetes/operator/kubecluster/kubecluster.py", line 706, in _close
    cluster = await DaskCluster.get(self.name, namespace=self.namespace)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/kr8s/_objects.py", line 265, in get
    raise e
  File "/usr/local/lib/python3.11/site-packages/kr8s/_objects.py", line 255, in get
    resources = await api.async_get(
                ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/kr8s/_api.py", line 460, in async_get
    async with self.async_get_kind(
  File "/usr/local/lib/python3.11/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/kr8s/_api.py", line 396, in async_get_kind
    async with self.call_api(
  File "/usr/local/lib/python3.11/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/kr8s/_api.py", line 186, in call_api
    raise ServerError(
kr8s._exceptions.ServerError: daskclusters.kubernetes.dask.org "test-cluster" is forbidden: User "system:serviceaccount:myns:dask" cannot list resource "daskclusters" in API group "kubernetes.dask.org" in the namespace "default"

Minimal Complete Verifiable Example:

Running this inside a pod:

from dask_kubernetes.operator.kubecluster.kubecluster import KubeCluster, make_cluster_spec

if __name__ == '__main__':
    spec = make_cluster_spec(
        name="test-cluster",
    )
    cluster = KubeCluster(
        custom_cluster_spec=spec,
    )

    cluster.adapt(minimum=0, maximum=2)

Anything else we need to know?:

When running the exact same test with 2024.5.0 version, it works fine so I think this is due to an update made in the 2024.8.0 release since it does not work since this version.

To make this work with 2024.8.0 or later, I need to define the namespace option when instantiating the KubeCluster (but I don't know the ns in advance in my use case):

Environment:

  • Dask version: 2024.11.2
  • Python version: 3.11.9
  • Operating System: ubuntu 22.04
  • Install method (conda, pip, source): pip
@jacobtomlinson
Copy link
Member

It looks like you're trying to create the KubeCluster in a different namespace to the one you installed the operator in. What happens if you set the namespace explicitly?

spec = make_cluster_spec(
        name="test-cluster",
        namespace="myns",
    )

@jacobtomlinson jacobtomlinson added the needs info Needs further information from the user label Dec 5, 2024
@john-jam
Copy link
Contributor Author

john-jam commented Dec 6, 2024

@jacobtomlinson The operator is in actually in myns namespace but yes, as mentioned in my comment, specifying the namespace this way fix the issue.

My use case is that I don't especially know in advance the namespace name but would like to create a KubeCluster in the same namespace as the operator. The KubeCluster were able to properly guess the namespace of the operator before 2024.8.0 and use this namespace automatically. In 2024.8.0, it uses the default namespace if you don't provide it explicitly.

spec = make_cluster_spec(
    name="test-cluster",
)
cluster = KubeCluster(
    custom_cluster_spec=spec,
)

# Will use `myns` in 2024.5.0
# Will use `default` in 2024.8.0

@jacobtomlinson
Copy link
Member

If you do kubectl get pods it will list Pods in the default namespace unless you specify otherwise. It looks like something changed in 2024.8.0 to mimic that behaviour.

But I see that the expectation here is that if the operator is installed in a single namespace then all DaskCluster resources should default to that namespace if no other default is set in your config?

You said you are running your code inside a Pod. Which namespace is that Pod running in?

@john-jam
Copy link
Contributor Author

All the pods (the pod from where I run this code to contact the operator + the operator itself) are in the same myns namespace.

But good point regarding the update to comply with kubectl behavior 🤔 I was just curious why this worked in prior version but it seems it was a bug feature from a previous non-standard behavior. Is that correct? If yes, I'll have to use k8s api to retrieve the ns name at runtime.

In any case thanks for your support!

@jacobtomlinson
Copy link
Member

It's likely this was just some unintended behaviour that changed at some point, see Hyrum's Law. That being said I don't think it's unreasonable to dig into this further as I am a little surprised by this behaviour.

My next question would be how are you authenticating with the Kubernetes API from within your Pod? Are you using a Service Account, or are you storing credentials in ~/.kube/config?

@jacobtomlinson
Copy link
Member

Ok I just confirmed that kubectl does not behave this way. It will try to use whatever namespace the Pod is running in as the default.

Reproducer

Create a new namespace
$ kubectl create namespace foo
namespace "foo" created

Run an interactive Pod in that namespace
$ kubectl run --image ubuntu --namespace foo --rm -it -- bash

Install kubectl
# apt update && apt install curl -y && curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl" && chmod +x kubectl && mv kubectl /usr/local/bin/
...

Try to list Pods
# kubectl get pods
Error from server (Forbidden): pods is forbidden: User "system:serviceaccount:foo:default" cannot list resource "pods" in API group "" in the namespace "foo"

We can see that when we try to list the Pods it gives us a Permission error, but it's defaulting to the foo namespace. So when there is absolutely no configuration I expect it's looking up the current namespace in /var/run/secrets/kubernetes.io/serviceaccount/.

@jacobtomlinson
Copy link
Member

It looks like this is a bug in kr8s as it does not behave the same way. I'll raise a bug over there.

Reproducer

Create a new namespace
$ kubectl create namespace foo
namespace "foo" created

Run an interactive Pod in that namespace
$ kubectl run python --image python --namespace foo --rm -it -- bash

Install kr8s
# pip install kr8s
...

Try to list Pods
# python -c 'import kr8s; kr8s.get("pods")'
...
kr8s._exceptions.ServerError: pods is forbidden: User "system:serviceaccount:foo:default" cannot list resource "pods" in API group "" in the namespace "default"

@jacobtomlinson
Copy link
Member

xref kr8s-org/kr8s#532

@john-jam
Copy link
Contributor Author

john-jam commented Dec 11, 2024

@jacobtomlinson Thanks for the investigation! I'll follow your bug report on kr8s then 🙇‍♂️

My next question would be how are you authenticating with the Kubernetes API from within your Pod? Are you using a Service Account, or are you storing credentials in ~/.kube/config?

I used a GKE cluster for my tests with a service account that has the proper permissions within the myns namespace. I reproduced your tests within this GKE cluster and confirm that kubectl and kr8s behave differently there as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs info Needs further information from the user
Projects
None yet
Development

No branches or pull requests

2 participants