Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Failed to run rayjob in the sandbox #4026

Closed
2 tasks done
pingsutw opened this issue Sep 12, 2023 · 6 comments
Closed
2 tasks done

[BUG] Failed to run rayjob in the sandbox #4026

pingsutw opened this issue Sep 12, 2023 · 6 comments
Assignees
Labels
bug Something isn't working Epic: Ray Ray/KubeRay Support in Flyte good first issue Good for newcomers plugins Plugins related labels (backend or frontend)

Comments

@pingsutw
Copy link
Member

Describe the bug

I can't run the Ray task in the sandbox, but it works if I run the same task in the EKS cluster.

import typing

from flytekit import ImageSpec, Resources, task, workflow

custom_image = ImageSpec(
    registry="pingsutw",
    packages=["flytekitplugins-ray"],
)

if custom_image.is_container():
    import ray
    from flytekitplugins.ray import HeadNodeConfig, RayJobConfig, WorkerNodeConfig

@ray.remote
def f1(x):
    return x * x

@ray.remote
def f2(x):
    return x%2

ray_config = RayJobConfig(
    head_node_config=HeadNodeConfig(ray_start_params={"log-color": "True"}),
    worker_node_config=[WorkerNodeConfig(group_name="ray-group", replicas=1)],
    runtime_env={"pip": ["numpy", "pandas"]},  # or runtime_env="./requirements.txt"
)

@task(cache=True, cache_version="0.2",
    task_config=ray_config,
    requests=Resources(mem="1Gi", cpu="1"),
    container_image=custom_image,
)
def ray_task(n: int) -> int:
    futures = [f2.remote(f1.remote(i)) for i in range(n)]
    return sum(ray.get(futures))


@workflow
def ray_workflow(n: int) -> int:
    return ray_task(n=n)


if __name__ == '__main__':
    ray_workflow(n=10)

```### 

### Expected behavior

Should be able to run a ray task in the sandbox

### Additional context to reproduce

_No response_

### Screenshots

```bash
(artifact) ➜  flyteidl git:(trigger) kgp -n flytesnacks-development
NAME                                                      READY   STATUS      RESTARTS   AGE
f0383f80a4dd84a7c84c-n0-3                                 0/1     Completed   0          64m
a42x4dp5xpgphrvhc6jb-n0-0-raycluster-lwh8r-head-pnbdl     1/1     Running     0          4m29s
gphrvhc6jb-n0-0-raycluster-lwh8r-worker-ray-group-2fzxq   0/1     Error       0          4m29s
gphrvhc6jb-n0-0-raycluster-lwh8r-worker-ray-group-k8dbr   0/1     Error       0          3m54s
gphrvhc6jb-n0-0-raycluster-lwh8r-worker-ray-group-vh5j5   0/1     Error       0          3m37s
gphrvhc6jb-n0-0-raycluster-lwh8r-worker-ray-group-phdqn   0/1     Error       0          3m19s
gphrvhc6jb-n0-0-raycluster-lwh8r-worker-ray-group-7wrl2   0/1     Error       0          3m1s
gphrvhc6jb-n0-0-raycluster-lwh8r-worker-ray-group-bcnxb   0/1     Error       0          2m44s
gphrvhc6jb-n0-0-raycluster-lwh8r-worker-ray-group-tf7xg   0/1     Error       0          2m26s
gphrvhc6jb-n0-0-raycluster-lwh8r-worker-ray-group-mvchf   0/1     Error       0          2m9s
gphrvhc6jb-n0-0-raycluster-lwh8r-worker-ray-group-hsnls   0/1     Error       0          111s
gphrvhc6jb-n0-0-raycluster-lwh8r-worker-ray-group-g4jc8   0/1     Error       0          94s
gphrvhc6jb-n0-0-raycluster-lwh8r-worker-ray-group-44nkj   0/1     Error       0          77s
gphrvhc6jb-n0-0-raycluster-lwh8r-worker-ray-group-snv59   0/1     Error       0          59s
gphrvhc6jb-n0-0-raycluster-lwh8r-worker-ray-group-6p5dj   0/1     Error       0          41s
gphrvhc6jb-n0-0-raycluster-lwh8r-worker-ray-group-2tzfd   0/1     Error       0          24s
gphrvhc6jb-n0-0-raycluster-lwh8r-worker-ray-group-lkw82   1/1     Running     0          7s
export KUBERAY_VERSION=v0.5.2
kubectl create -k "github.com/ray-project/kuberay/ray-operator/config/default?ref=${KUBERAY_VERSION}"

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes
@pingsutw pingsutw added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers plugins Plugins related labels (backend or frontend) Epic: Ray Ray/KubeRay Support in Flyte good first issue Good for newcomers and removed untriaged This issues has not yet been looked at by the Maintainers labels Sep 12, 2023
@Future-Outlier
Copy link
Member

I am interested!
If possible, please give me a chance!

@ashahab
Copy link

ashahab commented Sep 15, 2023

The issue is potentially here: https://github.com/flyteorg/flytekit/blob/master/flytekit/core/python_function_task.py#L101
I see that for many of the cases the config map that specifies the plugin is not accounted for. Maybe it's only loaded at startup?

Without a reload of the config, the task would be considered a python task and not a ray task

@ashahab
Copy link

ashahab commented Sep 15, 2023

Ok I have worked around the issue by locating the flyte pod and killing it. When the deployment restarts it, the pod will get the new config. The bug therefore is that ConfigMap changes are not watched for and reloaded by the demo cluster.
Now my flyte sandbox/demo cluster is creating ray jobs and ray clusters:

k get pods -n flyte
NAME                                                  READY   STATUS    RESTARTS   AGE
flyte-sandbox-proxy-d95874857-4mxmg                   1/1     Running   0          59m
flyte-sandbox-docker-registry-764bf7c89f-rhs29        1/1     Running   0          59m
flyte-sandbox-kubernetes-dashboard-6757db879c-j922w   1/1     Running   0          59m
flyte-sandbox-buildkit-7d7d55dbb-s8bhd                1/1     Running   0          59m
flyte-sandbox-postgresql-0                            1/1     Running   0          59m
flyte-sandbox-minio-645c8ddf7c-vb2rx                  1/1     Running   0          59m
flyte-sandbox-7d699df5fc-g6wsp                        1/1     Running   0          6m37s

kubectl delete the one that just says "flyte-sandbox--", and you will see the change.

kc get rayclusters --all-namespaces
NAMESPACE                 NAME                                         DESIRED WORKERS   AVAILABLE WORKERS   STATUS   AGE
flytesnacks-development   f741cb60938e6411cbe3-n0-0-raycluster-cpn9q   2                 2                   ready    5m36s

@pingsutw
Copy link
Member Author

@ashahab Did Rayjob complete without errors?

@bhattarai842
Copy link

While running in the EKS cluster do you see the head and worker node created? Despite following what was instructed in the example, for me it always ran in local mode in the EKS cluster.

@pingsutw
Copy link
Member Author

For those who run into same issue, the job failed because I'm running sandbox on M2. Ray doesn't have muti-arch image, so you have to build an ARM image by yourself.

@pingsutw pingsutw self-assigned this Sep 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Epic: Ray Ray/KubeRay Support in Flyte good first issue Good for newcomers plugins Plugins related labels (backend or frontend)
Projects
None yet
Development

No branches or pull requests

4 participants