Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] ray job submit doesn't always catch the last lines of the job logs #48701

Open
kpouget opened this issue Nov 12, 2024 · 2 comments · May be fixed by #49337
Open

[Core] ray job submit doesn't always catch the last lines of the job logs #48701

kpouget opened this issue Nov 12, 2024 · 2 comments · May be fixed by #49337
Assignees
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P1 Issue that should be fixed within a few weeks

Comments

@kpouget
Copy link

kpouget commented Nov 12, 2024

What happened + What you expected to happen

When I launch Ray jobs as part of OpenShift AI (RayJobs in K8sJobMode mode), I observe that the end of logs of the job isn't always correctly captured.

The submit command (part of the Job created out of the RayJob) is the following:

        - ray
        - job
        - submit
        - --address
        - http://rayjob-sample-raycluster-25q9n-head-svc.topsail.svc.cluster.local:8265
        - --runtime-env-json
        - '{"pip":[]}'
        - --submission-id
        - rayjob-sample-2zcmx
        - --
        - bash
        - /home/ray/samples/entrypoint.sh

and sometimes, the logs of this Pod do not contain the last lines printed by my entrypoint.sh script:

oc logs rayjob-sample-9r7vm | tail -15
│ function_trainable_17836_00193   TERMINATED   0.26546   │
│ function_trainable_17836_00194   TERMINATED   0.268351  │
│ function_trainable_17836_00195   TERMINATED   0.971191  │
│ function_trainable_17836_00196   TERMINATED   0.683966  │
│ function_trainable_17836_00197   TERMINATED   0.509735  │
│ function_trainable_17836_00198   TERMINATED   0.414847  │
│ function_trainable_17836_00199   TERMINATED   0.949224  │
╰─────────────────────────────────────────────────────────╯

The result network overhead test took 6.04 seconds, which is below the budget of 500.00 seconds. Test successful.

--- PASSED: RESULT NETWORK OVERHEAD ::: 6.04 <= 500.00 ---
2024-11-12 14:55:08,840	SUCC cli.py:63 -- -----------------------------------
2024-11-12 14:55:08,840	SUCC cli.py:64 -- Job 'rayjob-sample-2zcmx' succeeded
2024-11-12 14:55:08,841	SUCC cli.py:65 -- -----------------------------------

However, if I rsh into Ray's head Pod, I see that it is correctly captured:

(app-root) sh-5.1$ ray job logs rayjob-sample-2zcmx | tail -10
│ function_trainable_17836_00197   TERMINATED   0.509735  │
│ function_trainable_17836_00198   TERMINATED   0.414847  │
│ function_trainable_17836_00199   TERMINATED   0.949224  │
╰─────────────────────────────────────────────────────────╯

The result network overhead test took 6.04 seconds, which is below the budget of 500.00 seconds. Test successful.

--- PASSED: RESULT NETWORK OVERHEAD ::: 6.04 <= 500.00 ---
+ echo 'SCRIPT SUCCEEDED'
SCRIPT SUCCEEDED

This issue is at the boundary between Ray and KubeRay, but I think that it should be reproducible outside of the K8s environment, so I chose to fill the issue in this repository.

Versions / Dependencies

2.35.0
quay.io/rhoai/ray:2.35.0-py311-cu121-torch24-fa26

Reproduction script

Sample job (ray-job-sample.yaml)

apiVersion: ray.io/v1
kind: RayJob
metadata:
  name: rayjob-sample
spec:
  submissionMode: "K8sJobMode"
  entrypoint: bash /home/ray/samples/entrypoint.sh

  runtimeEnvYAML: |
    pip: []

  rayClusterSpec:
    rayVersion: '3.35.0' # should match the Ray version in the image of the containers
    headGroupSpec:
      rayStartParams:
        dashboard-host: '0.0.0.0'
      #pod template
      template:
        spec:
          containers:
            - name: ray-head
              image: quay.io/rhoai/ray:2.35.0-py311-cu121-torch24-fa26
              ports:
                - containerPort: 6379
                  name: gcs-server
                - containerPort: 8265 # Ray dashboard
                  name: dashboard
                - containerPort: 10001
                  name: client
              resources:
                limits:
                  cpu: "1"
                requests:
                  cpu: "200m"
              volumeMounts:
                - mountPath: /home/ray/samples
                  name: code-sample
          volumes:
            # You set volumes at the Pod level, then mount them into containers inside that Pod
            - name: code-sample
              configMap:
                # Provide the name of the ConfigMap you want to mount.
                name: ray-job-code-sample
                # An array of keys from the ConfigMap to create as files
                items:
                  - key: test_network_overhead.py
                    path: test_network_overhead.py
                  - key: entrypoint.sh
                    path: entrypoint.sh
    workerGroupSpecs:
      # the pod replicas in this group typed worker
      - replicas: 4
        minReplicas: 1
        maxReplicas: 5
        # logical group name, for this called small-group, also can be functional
        groupName: small-group
        rayStartParams: {}
        #pod template
        template:
          spec:
            containers:
              - name: ray-worker # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name',  or '123-abc'
                image: quay.io/rhoai/ray:2.35.0-py311-cu121-torch24-fa26
                lifecycle:
                  preStop:
                    exec:
                      command: [ "/bin/sh","-c","ray stop" ]
                resources:
                  limits:
                    cpu: "2"
                  requests:
                    cpu: "200m"

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: ray-job-code-sample
data:
  entrypoint.sh: |
    set -o pipefail;
    set -o errexit;
    set -o nounset;
    set -o errtrace;
    set -x;

    if python /home/ray/samples/test_network_overhead.py ; then
        echo "SCRIPT SUCCEEDED";
    else
        echo "SCRIPT FAILED";
        # don't exit with a return code != 0, otherwise the RayJob->Job retries 3 times ...
    fi

  test_network_overhead.py: |
    import os
    import json

    import ray

    from ray.tune.utils.release_test_util import timed_tune_run

    def main():
        ray.init(address="auto")

        num_samples = 200

        results_per_second = 0.01
        trial_length_s = 10

        max_runtime = 500

        success = timed_tune_run(
            name="result network overhead",
            num_samples=num_samples,
            results_per_second=results_per_second,
            trial_length_s=trial_length_s,
            max_runtime=max_runtime,
            # One trial per worker node, none get scheduled on the head node.
            # See the compute config.
            resources_per_trial={"cpu": 2},
        )


    if __name__ == "__main__":
        main()

Sample launcher:

# !/bin/bash

set -o pipefail
set -o errexit
set -o nounset
set -o errtrace
set -x

try_count=0
while true; do
    try_count=$((try_count+1))
    echo "Try #$try_count"
    oc delete -f ray-job.sample.yaml --ignore-not-found
    # ensure that the job is gone
    oc delete jobs/rayjob-sample --ignore-not-found
    oc apply -f ray-job.sample.yaml

    set +x
    echo "Waiting for the job to appear ..."

    while  ! oc get job/rayjob-sample -oname 2>/dev/null; do
        sleep 1;
    done

    echo "Waiting for the job to Complete ..."
    oc wait --for=condition=Complete job/rayjob-sample --timeout=900s

    echo "Checking the job logs ..."
    if ! oc logs job/rayjob-sample | grep -E 'SCRIPT SUCCEEDED|SCRIPT FAILED'; then
        echo "Termination message missing at try #{try_count}!"
        oc logs job/rayjob-sample | tail -25
        exit 1
    fi
done

Issue Severity

None

@kpouget kpouget added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 12, 2024
@rynewang rynewang added P1 Issue that should be fixed within a few weeks core Issues that should be addressed in Ray Core and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 12, 2024
@MortalHappiness MortalHappiness added the kuberay Issues for the Ray/Kuberay integration that are tracked on the Ray side label Nov 13, 2024
@MortalHappiness
Copy link
Member

Note for myself: oc is equal to kubectl

@MortalHappiness
Copy link
Member

Reproduction

Here is a simpler reproduction script.

task.py

import ray

ray.init(address="auto")

@ray.remote
def f():
    for i in range(1000000):
        print(f"Hello world: {i}")

ray.get([f.remote()])
ray start --head --include-dashboard=True
ray job submit --working-dir . -- python task.py
cd /tmp/ray/session_latest/logs
grep -i 'hello world: 0' *.out *.log

And then tail these 2 files

image

@MortalHappiness MortalHappiness removed the kuberay Issues for the Ray/Kuberay integration that are tracked on the Ray side label Dec 13, 2024
MortalHappiness added a commit to MortalHappiness/ray that referenced this issue Dec 18, 2024
MortalHappiness added a commit to MortalHappiness/ray that referenced this issue Dec 18, 2024
MortalHappiness added a commit to MortalHappiness/ray that referenced this issue Dec 18, 2024
MortalHappiness added a commit to MortalHappiness/ray that referenced this issue Dec 18, 2024
MortalHappiness added a commit to MortalHappiness/ray that referenced this issue Dec 18, 2024
MortalHappiness added a commit to MortalHappiness/ray that referenced this issue Dec 19, 2024
MortalHappiness added a commit to MortalHappiness/ray that referenced this issue Dec 19, 2024
MortalHappiness added a commit to MortalHappiness/ray that referenced this issue Dec 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P1 Issue that should be fixed within a few weeks
Projects
None yet
3 participants