[Core] `ray job submit` doesn't always catch the last lines of the job logs #48701

kpouget · 2024-11-12T15:22:44Z

What happened + What you expected to happen

When I launch Ray jobs as part of OpenShift AI (RayJobs in K8sJobMode mode), I observe that the end of logs of the job isn't always correctly captured.

The submit command (part of the Job created out of the RayJob) is the following:

        - ray
        - job
        - submit
        - --address
        - http://rayjob-sample-raycluster-25q9n-head-svc.topsail.svc.cluster.local:8265
        - --runtime-env-json
        - '{"pip":[]}'
        - --submission-id
        - rayjob-sample-2zcmx
        - --
        - bash
        - /home/ray/samples/entrypoint.sh

and sometimes, the logs of this Pod do not contain the last lines printed by my entrypoint.sh script:

oc logs rayjob-sample-9r7vm | tail -15
│ function_trainable_17836_00193   TERMINATED   0.26546   │
│ function_trainable_17836_00194   TERMINATED   0.268351  │
│ function_trainable_17836_00195   TERMINATED   0.971191  │
│ function_trainable_17836_00196   TERMINATED   0.683966  │
│ function_trainable_17836_00197   TERMINATED   0.509735  │
│ function_trainable_17836_00198   TERMINATED   0.414847  │
│ function_trainable_17836_00199   TERMINATED   0.949224  │
╰─────────────────────────────────────────────────────────╯

The result network overhead test took 6.04 seconds, which is below the budget of 500.00 seconds. Test successful.

--- PASSED: RESULT NETWORK OVERHEAD ::: 6.04 <= 500.00 ---
2024-11-12 14:55:08,840	SUCC cli.py:63 -- -----------------------------------
2024-11-12 14:55:08,840	SUCC cli.py:64 -- Job 'rayjob-sample-2zcmx' succeeded
2024-11-12 14:55:08,841	SUCC cli.py:65 -- -----------------------------------

However, if I rsh into Ray's head Pod, I see that it is correctly captured:

(app-root) sh-5.1$ ray job logs rayjob-sample-2zcmx | tail -10
│ function_trainable_17836_00197   TERMINATED   0.509735  │
│ function_trainable_17836_00198   TERMINATED   0.414847  │
│ function_trainable_17836_00199   TERMINATED   0.949224  │
╰─────────────────────────────────────────────────────────╯

The result network overhead test took 6.04 seconds, which is below the budget of 500.00 seconds. Test successful.

--- PASSED: RESULT NETWORK OVERHEAD ::: 6.04 <= 500.00 ---
+ echo 'SCRIPT SUCCEEDED'
SCRIPT SUCCEEDED

This issue is at the boundary between Ray and KubeRay, but I think that it should be reproducible outside of the K8s environment, so I chose to fill the issue in this repository.

Versions / Dependencies

2.35.0
quay.io/rhoai/ray:2.35.0-py311-cu121-torch24-fa26

Reproduction script

Sample job (ray-job-sample.yaml)

apiVersion: ray.io/v1
kind: RayJob
metadata:
  name: rayjob-sample
spec:
  submissionMode: "K8sJobMode"
  entrypoint: bash /home/ray/samples/entrypoint.sh

  runtimeEnvYAML: |
    pip: []

  rayClusterSpec:
    rayVersion: '3.35.0' # should match the Ray version in the image of the containers
    headGroupSpec:
      rayStartParams:
        dashboard-host: '0.0.0.0'
      #pod template
      template:
        spec:
          containers:
            - name: ray-head
              image: quay.io/rhoai/ray:2.35.0-py311-cu121-torch24-fa26
              ports:
                - containerPort: 6379
                  name: gcs-server
                - containerPort: 8265 # Ray dashboard
                  name: dashboard
                - containerPort: 10001
                  name: client
              resources:
                limits:
                  cpu: "1"
                requests:
                  cpu: "200m"
              volumeMounts:
                - mountPath: /home/ray/samples
                  name: code-sample
          volumes:
            # You set volumes at the Pod level, then mount them into containers inside that Pod
            - name: code-sample
              configMap:
                # Provide the name of the ConfigMap you want to mount.
                name: ray-job-code-sample
                # An array of keys from the ConfigMap to create as files
                items:
                  - key: test_network_overhead.py
                    path: test_network_overhead.py
                  - key: entrypoint.sh
                    path: entrypoint.sh
    workerGroupSpecs:
      # the pod replicas in this group typed worker
      - replicas: 4
        minReplicas: 1
        maxReplicas: 5
        # logical group name, for this called small-group, also can be functional
        groupName: small-group
        rayStartParams: {}
        #pod template
        template:
          spec:
            containers:
              - name: ray-worker # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name',  or '123-abc'
                image: quay.io/rhoai/ray:2.35.0-py311-cu121-torch24-fa26
                lifecycle:
                  preStop:
                    exec:
                      command: [ "/bin/sh","-c","ray stop" ]
                resources:
                  limits:
                    cpu: "2"
                  requests:
                    cpu: "200m"

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: ray-job-code-sample
data:
  entrypoint.sh: |
    set -o pipefail;
    set -o errexit;
    set -o nounset;
    set -o errtrace;
    set -x;

    if python /home/ray/samples/test_network_overhead.py ; then
        echo "SCRIPT SUCCEEDED";
    else
        echo "SCRIPT FAILED";
        # don't exit with a return code != 0, otherwise the RayJob->Job retries 3 times ...
    fi

  test_network_overhead.py: |
    import os
    import json

    import ray

    from ray.tune.utils.release_test_util import timed_tune_run

    def main():
        ray.init(address="auto")

        num_samples = 200

        results_per_second = 0.01
        trial_length_s = 10

        max_runtime = 500

        success = timed_tune_run(
            name="result network overhead",
            num_samples=num_samples,
            results_per_second=results_per_second,
            trial_length_s=trial_length_s,
            max_runtime=max_runtime,
            # One trial per worker node, none get scheduled on the head node.
            # See the compute config.
            resources_per_trial={"cpu": 2},
        )


    if __name__ == "__main__":
        main()

Sample launcher:

# !/bin/bash

set -o pipefail
set -o errexit
set -o nounset
set -o errtrace
set -x

try_count=0
while true; do
    try_count=$((try_count+1))
    echo "Try #$try_count"
    oc delete -f ray-job.sample.yaml --ignore-not-found
    # ensure that the job is gone
    oc delete jobs/rayjob-sample --ignore-not-found
    oc apply -f ray-job.sample.yaml

    set +x
    echo "Waiting for the job to appear ..."

    while  ! oc get job/rayjob-sample -oname 2>/dev/null; do
        sleep 1;
    done

    echo "Waiting for the job to Complete ..."
    oc wait --for=condition=Complete job/rayjob-sample --timeout=900s

    echo "Checking the job logs ..."
    if ! oc logs job/rayjob-sample | grep -E 'SCRIPT SUCCEEDED|SCRIPT FAILED'; then
        echo "Termination message missing at try #{try_count}!"
        oc logs job/rayjob-sample | tail -25
        exit 1
    fi
done

Issue Severity

None

The text was updated successfully, but these errors were encountered:

MortalHappiness · 2024-11-13T22:14:49Z

Note for myself: oc is equal to kubectl

MortalHappiness · 2024-12-13T14:06:48Z

Reproduction

Here is a simpler reproduction script.

task.py

import ray

ray.init(address="auto")

@ray.remote
def f():
    for i in range(1000000):
        print(f"Hello world: {i}")

ray.get([f.remote()])

ray start --head --include-dashboard=True
ray job submit --working-dir . -- python task.py
cd /tmp/ray/session_latest/logs
grep -i 'hello world: 0' *.out *.log

And then tail these 2 files

… prevent pending logs Closes: ray-project#48701 Signed-off-by: Chi-Sheng Liu <[email protected]>

kpouget added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 12, 2024

rynewang assigned MortalHappiness Nov 12, 2024

rynewang added P1 Issue that should be fixed within a few weeks core Issues that should be addressed in Ray Core and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 12, 2024

MortalHappiness added the kuberay Issues for the Ray/Kuberay integration that are tracked on the Ray side label Nov 13, 2024

MortalHappiness removed the kuberay Issues for the Ray/Kuberay integration that are tracked on the Ray side label Dec 13, 2024

MortalHappiness added a commit to MortalHappiness/ray that referenced this issue Dec 18, 2024

[Fix][Core] Wait a while before stopping the ray_print_logs thread to…

150b116

… prevent pending logs Closes: ray-project#48701 Signed-off-by: Chi-Sheng Liu <[email protected]>

MortalHappiness linked a pull request Dec 18, 2024 that will close this issue

[Fix][Core] Wait a while before stopping the ray_print_logs thread to prevent pending logs #49337

Draft

8 tasks

MortalHappiness added a commit to MortalHappiness/ray that referenced this issue Dec 18, 2024

[Fix][Core] Wait a while before stopping the ray_print_logs thread to…

f42a164

… prevent pending logs Closes: ray-project#48701 Signed-off-by: Chi-Sheng Liu <[email protected]>

MortalHappiness added a commit to MortalHappiness/ray that referenced this issue Dec 18, 2024

[Fix][Core] Wait a while before stopping the ray_print_logs thread to…

2586dc1

… prevent pending logs Closes: ray-project#48701 Signed-off-by: Chi-Sheng Liu <[email protected]>

MortalHappiness added a commit to MortalHappiness/ray that referenced this issue Dec 18, 2024

[Fix][Core] Wait a while before stopping the ray_print_logs thread to…

9594db8

… prevent pending logs Closes: ray-project#48701 Signed-off-by: Chi-Sheng Liu <[email protected]>

MortalHappiness added a commit to MortalHappiness/ray that referenced this issue Dec 18, 2024

[Fix][Core] Wait a while before stopping the ray_print_logs thread to…

57cdb10

… prevent pending logs Closes: ray-project#48701 Signed-off-by: Chi-Sheng Liu <[email protected]>

MortalHappiness added a commit to MortalHappiness/ray that referenced this issue Dec 19, 2024

[Fix][Core] Wait a while before stopping the ray_print_logs thread to…

d33b6d3

… prevent pending logs Closes: ray-project#48701 Signed-off-by: Chi-Sheng Liu <[email protected]>

MortalHappiness added a commit to MortalHappiness/ray that referenced this issue Dec 19, 2024

[Fix][Core] Wait a while before stopping the ray_print_logs thread to…

90fb1a6

… prevent pending logs Closes: ray-project#48701 Signed-off-by: Chi-Sheng Liu <[email protected]>

MortalHappiness added a commit to MortalHappiness/ray that referenced this issue Dec 19, 2024

[Fix][Core] Wait a while before stopping the ray_print_logs thread to…

9e4fb06

… prevent pending logs Closes: ray-project#48701 Signed-off-by: Chi-Sheng Liu <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] `ray job submit` doesn't always catch the last lines of the job logs #48701

[Core] `ray job submit` doesn't always catch the last lines of the job logs #48701

kpouget commented Nov 12, 2024

MortalHappiness commented Nov 13, 2024

MortalHappiness commented Dec 13, 2024

[Core] ray job submit doesn't always catch the last lines of the job logs #48701

[Core] ray job submit doesn't always catch the last lines of the job logs #48701

Comments

kpouget commented Nov 12, 2024

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

MortalHappiness commented Nov 13, 2024

MortalHappiness commented Dec 13, 2024

Reproduction

[Core] `ray job submit` doesn't always catch the last lines of the job logs #48701

[Core] `ray job submit` doesn't always catch the last lines of the job logs #48701