Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing to get started with tutorial on GCP #3900

Closed
kesitrifork opened this issue Aug 31, 2024 · 8 comments · Fixed by #3906
Closed

Failing to get started with tutorial on GCP #3900

kesitrifork opened this issue Aug 31, 2024 · 8 comments · Fixed by #3906

Comments

@kesitrifork
Copy link

I tried the getting starte on Google Cloud and it failed both first and second time, and output this whole file as error:

$ sky launch -c mycluster hello_sky.yaml
Task from YAML spec: hello_sky.yaml
Running task on cluster mycluster...
I 08-31 21:01:59 cloud_vm_ray_backend.py:1313] To view detailed progress: tail -n100 -f /Users/kevinsimper/sky_logs/sky-2024-08-31-21-01-58-553026/provision.log
I 08-31 21:02:01 provisioner.py:65] Launching on GCP us-central1 (us-central1-a)
I 08-31 21:02:20 provisioner.py:448] Successfully provisioned or found existing instance.
I 08-31 21:02:39 provisioner.py:550] Successfully provisioned cluster: mycluster
I 08-31 21:02:39 cloud_vm_ray_backend.py:3050] Syncing workdir (to 1 node): . -> ~/sky_workdir
I 08-31 21:02:39 cloud_vm_ray_backend.py:3058] To view detailed progress: tail -n100 -f ~/sky_logs/sky-2024-08-31-21-01-58-553026/workdir_sync.log
I 08-31 21:02:40 cloud_vm_ray_backend.py:3162] Running setup on 1 node.
bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
bash: no job control in this shell
bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
Running setup.
I 08-31 21:02:42 cloud_vm_ray_backend.py:3175] Setup completed.
E 08-31 21:02:44 subprocess_utils.py:84] mm_send_fd: sendmsg(2): Message too long
E 08-31 21:02:44 subprocess_utils.py:84] mux_client_request_session: send fds failed
E 08-31 21:02:44 subprocess_utils.py:84]
I 08-31 21:02:44 cloud_vm_ray_backend.py:3393]
I 08-31 21:02:44 cloud_vm_ray_backend.py:3393] Cluster name: mycluster
I 08-31 21:02:44 cloud_vm_ray_backend.py:3393] To log into the head VM:	ssh mycluster
I 08-31 21:02:44 cloud_vm_ray_backend.py:3393] To submit a job:		sky exec mycluster yaml_file
I 08-31 21:02:44 cloud_vm_ray_backend.py:3393] To stop the cluster:	sky stop mycluster
I 08-31 21:02:44 cloud_vm_ray_backend.py:3393] To teardown the cluster:	sky down mycluster
Clusters
NAME       LAUNCHED        RESOURCES              STATUS  AUTOSTOP  COMMAND
mycluster  a few secs ago  1x GCP(n2-standard-8)  UP      -         sky launch -c mycluster h...

sky.exceptions.CommandError: Command cd ~/sky_workdir && mkdir -p ~/sky_logs/sky-2024-08-31-21-01-58-553026 && touch ~/sky_logs/sky-2024-08-31-21-01-58-553026/run.log && { echo 'import getpass
import hashlib
import io
import os
import pathlib
import selectors
import shlex
import subprocess
import sys
import tempfile
import textwrap
import time
from typing import Dict, List, Optional, Tuple, Union

import ray
import ray.util as ray_util

from sky.skylet import autostop_lib
from sky.skylet import constants
from sky.skylet import job_lib
from sky.utils import log_utils

SKY_REMOTE_WORKDIR = '"'"'~/sky_workdir'"'"'

kwargs = dict()
# Only set the `_temp_dir` to SkyPilot'"'"'s ray cluster directory when
# the directory exists for backward compatibility for the VM
# launched before #1790.
if os.path.exists('"'"'/tmp/ray_skypilot'"'"'):
    kwargs['"'"'_temp_dir'"'"'] = '"'"'/tmp/ray_skypilot'"'"'
ray.init(
    address='"'"'auto'"'"',
    namespace='"'"'__sky__2__'"'"',
    log_to_driver=True,
    **kwargs
)
def get_or_fail(futures, pg) -> List[int]:
    """Wait for tasks, if any fails, cancel all unready."""
    returncodes = [1] * len(futures)
    # Wait for 1 task to be ready.
    ready = []
    # Keep invoking ray.wait if ready is empty. This is because
    # ray.wait with timeout=None will only wait for 10**6 seconds,
    # which will cause tasks running for more than 12 days to return
    # before becoming ready.
    # (Such tasks are common in serving jobs.)
    # Reference: https://github.com/ray-project/ray/blob/ray-2.9.3/python/ray/_private/worker.py#L2845-L2846
    while not ready:
        ready, unready = ray.wait(futures)
    idx = futures.index(ready[0])
    returncodes[idx] = ray.get(ready[0])
    while unready:
        if returncodes[idx] != 0:
            for task in unready:
                # ray.cancel without force fails to kill tasks.
                # We use force=True to kill unready tasks.
                ray.cancel(task, force=True)
                # Use SIGKILL=128+9 to indicate the task is forcely
                # killed.
                idx = futures.index(task)
                returncodes[idx] = 137
            break
        ready, unready = ray.wait(unready)
        idx = futures.index(ready[0])
        returncodes[idx] = ray.get(ready[0])
    # Remove the placement group after all tasks are done, so that
    # the next job can be scheduled on the released resources
    # immediately.
    ray_util.remove_placement_group(pg)
    sys.stdout.flush()
    return returncodes

run_fn = None
futures = []

class _ProcessingArgs:
    """Arguments for processing logs."""

    def __init__(self,
                 log_path: str,
                 stream_logs: bool,
                 start_streaming_at: str = '"'"''"'"',
                 end_streaming_at: Optional[str] = None,
                 skip_lines: Optional[List[str]] = None,
                 replace_crlf: bool = False,
                 line_processor: Optional[log_utils.LineProcessor] = None,
                 streaming_prefix: Optional[str] = None) -> None:
        self.log_path = log_path
        self.stream_logs = stream_logs
        self.start_streaming_at = start_streaming_at
        self.end_streaming_at = end_streaming_at
        self.skip_lines = skip_lines
        self.replace_crlf = replace_crlf
        self.line_processor = line_processor
        self.streaming_prefix = streaming_prefix

def _handle_io_stream(io_stream, out_stream, args: _ProcessingArgs):
    """Process the stream of a process."""
    out_io = io.TextIOWrapper(io_stream,
                              encoding='"'"'utf-8'"'"',
                              newline='"'"''"'"',
                              errors='"'"'replace'"'"',
                              write_through=True)

    start_streaming_flag = False
    end_streaming_flag = False
    streaming_prefix = args.streaming_prefix if args.streaming_prefix else '"'"''"'"'
    line_processor = (log_utils.LineProcessor()
                      if args.line_processor is None else args.line_processor)

    out = []
    with open(args.log_path, '"'"'a'"'"', encoding='"'"'utf-8'"'"') as fout:
        with line_processor:
            while True:
                line = out_io.readline()
                if not line:
                    break
                # start_streaming_at logic in processor.process_line(line)
                if args.replace_crlf and line.endswith('"'"'\r\n'"'"'):
                    # Replace CRLF with LF to avoid ray logging to the same
                    # line due to separating lines with '"'"'\n'"'"'.
                    line = line[:-2] + '"'"'\n'"'"'
                if (args.skip_lines is not None and
                        any(skip in line for skip in args.skip_lines)):
                    continue
                if args.start_streaming_at in line:
                    start_streaming_flag = True
                if (args.end_streaming_at is not None and
                        args.end_streaming_at in line):
                    # Keep executing the loop, only stop streaming.
                    # E.g., this is used for `sky bench` to hide the
                    # redundant messages of `sky launch` while
                    # saving them in log files.
                    end_streaming_flag = True
                if (args.stream_logs and start_streaming_flag and
                        not end_streaming_flag):
                    print(streaming_prefix + line,
                          end='"'"''"'"',
                          file=out_stream,
                          flush=True)
                if args.log_path != '"'"'/dev/null'"'"':
                    fout.write(line)
                    fout.flush()
                line_processor.process_line(line)
                out.append(line)
    return '"'"''"'"'.join(out)

def process_subprocess_stream(proc, args: _ProcessingArgs) -> Tuple[str, str]:
    """Redirect the process'"'"'s filtered stdout/stderr to both stream and file"""
    if proc.stderr is not None:
        # Asyncio does not work as the output processing can be executed in a
        # different thread.
        # selectors is possible to handle the multiplexing of stdout/stderr,
        # but it introduces buffering making the output not streaming.
        with multiprocessing.pool.ThreadPool(processes=1) as pool:
            err_args = copy.copy(args)
            err_args.line_processor = None
            stderr_fut = pool.apply_async(_handle_io_stream,
                                          args=(proc.stderr, sys.stderr,
                                                err_args))
            # Do not launch a thread for stdout as the rich.status does not
            # work in a thread, which is used in
            # log_utils.RayUpLineProcessor.
            stdout = _handle_io_stream(proc.stdout, sys.stdout, args)
            stderr = stderr_fut.get()
    else:
        stdout = _handle_io_stream(proc.stdout, sys.stdout, args)
        stderr = '"'"''"'"'
    return stdout, stderr

def run_with_log(
    cmd: Union[List[str], str],
    log_path: str,
    *,
    require_outputs: bool = False,
    stream_logs: bool = False,
    start_streaming_at: str = '"'"''"'"',
    end_streaming_at: Optional[str] = None,
    skip_lines: Optional[List[str]] = None,
    shell: bool = False,
    with_ray: bool = False,
    process_stream: bool = True,
    line_processor: Optional[log_utils.LineProcessor] = None,
    streaming_prefix: Optional[str] = None,
    **kwargs,
) -> Union[int, Tuple[int, str, str]]:
    """Runs a command and logs its output to a file.

    Args:
        cmd: The command to run.
        log_path: The path to the log file.
        stream_logs: Whether to stream the logs to stdout/stderr.
        require_outputs: Whether to return the stdout/stderr of the command.
        process_stream: Whether to post-process the stdout/stderr of the
            command, such as replacing or skipping lines on the fly. If
            enabled, lines are printed only when '"'"'\r'"'"' or '"'"'\n'"'"' is found.

    Returns the returncode or returncode, stdout and stderr of the command.
      Note that the stdout and stderr is already decoded.
    """
    assert process_stream or not require_outputs, (
        process_stream, require_outputs,
        '"'"'require_outputs should be False when process_stream is False'"'"')

    log_path = os.path.expanduser(log_path)
    dirname = os.path.dirname(log_path)
    os.makedirs(dirname, exist_ok=True)
    # Redirect stderr to stdout when using ray, to preserve the order of
    # stdout and stderr.
    stdout_arg = stderr_arg = None
    if process_stream:
        stdout_arg = subprocess.PIPE
        stderr_arg = subprocess.PIPE if not with_ray else subprocess.STDOUT
    with subprocess.Popen(cmd,
                          stdout=stdout_arg,
                          stderr=stderr_arg,
                          start_new_session=True,
                          shell=shell,
                          **kwargs) as proc:
        try:
            # The proc can be defunct if the python program is killed. Here we
            # open a new subprocess to gracefully kill the proc, SIGTERM
            # and then SIGKILL the process group.
            # Adapted from ray/dashboard/modules/job/job_manager.py#L154
            parent_pid = os.getpid()
            daemon_script = os.path.join(
                os.path.dirname(os.path.abspath(job_lib.__file__)),
                '"'"'subprocess_daemon.py'"'"')
            if not hasattr(constants, '"'"'SKY_GET_PYTHON_PATH_CMD'"'"'):
                # Backward compatibility: for cluster started before #3326, this
                # constant does not exist. Since we generate the job script
                # in backends.cloud_vm_ray_backend with inspect, so the
                # the lates `run_with_log` will be used, but the `constants` is
                # not updated. We fallback to `python3` in this case.
                # TODO(zhwu): remove this after 0.7.0.
                python_path = '"'"'python3'"'"'
            else:
                python_path = subprocess.check_output(
                    constants.SKY_GET_PYTHON_PATH_CMD,
                    shell=True,
                    stderr=subprocess.DEVNULL,
                    encoding='"'"'utf-8'"'"').strip()
            daemon_cmd = [
                python_path,
                daemon_script,
                '"'"'--parent-pid'"'"',
                str(parent_pid),
                '"'"'--proc-pid'"'"',
                str(proc.pid),
            ]

            subprocess.Popen(
                daemon_cmd,
                start_new_session=True,
                # Suppress output
                stdout=subprocess.DEVNULL,
                stderr=subprocess.DEVNULL,
                # Disable input
                stdin=subprocess.DEVNULL,
            )
            stdout = '"'"''"'"'
            stderr = '"'"''"'"'

            if process_stream:
                if skip_lines is None:
                    skip_lines = []
                # Skip these lines caused by `-i` option of bash. Failed to
                # find other way to turn off these two warning.
                # https://stackoverflow.com/questions/13300764/how-to-tell-bash-not-to-issue-warnings-cannot-set-terminal-process-group-and # pylint: disable=line-too-long
                # `ssh -T -i -tt` still cause the problem.
                skip_lines += [
                    '"'"'bash: cannot set terminal process group'"'"',
                    '"'"'bash: no job control in this shell'"'"',
                ]
                # We need this even if the log_path is '"'"'/dev/null'"'"' to ensure the
                # progress bar is shown.
                # NOTE: Lines are printed only when '"'"'\r'"'"' or '"'"'\n'"'"' is found.
                args = _ProcessingArgs(
                    log_path=log_path,
                    stream_logs=stream_logs,
                    start_streaming_at=start_streaming_at,
                    end_streaming_at=end_streaming_at,
                    skip_lines=skip_lines,
                    line_processor=line_processor,
                    # Replace CRLF when the output is logged to driver by ray.
                    replace_crlf=with_ray,
                    streaming_prefix=streaming_prefix,
                )
                stdout, stderr = process_subprocess_stream(proc, args)
            proc.wait()
            if require_outputs:
                return proc.returncode, stdout, stderr
            return proc.returncode
        except KeyboardInterrupt:
            # Kill the subprocess directly, otherwise, the underlying
            # process will only be killed after the python program exits,
            # causing the stream handling stuck at `readline`.
            subprocess_utils.kill_children_processes()
            raise

def make_task_bash_script(codegen: str,
                          env_vars: Optional[Dict[str, str]] = None) -> str:
    # set -a is used for exporting all variables functions to the environment
    # so that bash `user_script` can access `conda activate`. Detail: #436.
    # Reference: https://www.gnu.org/software/bash/manual/html_node/The-Set-Builtin.html # pylint: disable=line-too-long
    # DEACTIVATE_SKY_REMOTE_PYTHON_ENV: Deactivate the SkyPilot runtime env, as
    # the ray cluster is started within the runtime env, which may cause the
    # user program to run in that env as well.
    # PYTHONUNBUFFERED is used to disable python output buffering.
    script = [
        textwrap.dedent(f"""\
            #!/bin/bash
            source ~/.bashrc
            set -a
            . $(conda info --base 2> /dev/null)/etc/profile.d/conda.sh > /dev/null 2>&1 || true
            set +a
            {constants.DEACTIVATE_SKY_REMOTE_PYTHON_ENV}
            export PYTHONUNBUFFERED=1
            cd {constants.SKY_REMOTE_WORKDIR}"""),
    ]
    if env_vars is not None:
        for k, v in env_vars.items():
            script.append(f'"'"'export {k}={shlex.quote(str(v))}'"'"')
    script += [
        codegen,
        '"'"''"'"',  # New line at EOF.
    ]
    script = '"'"'\n'"'"'.join(script)
    return script

def add_ray_env_vars(
        env_vars: Optional[Dict[str, str]] = None) -> Dict[str, str]:
    # Adds Ray-related environment variables.
    if env_vars is None:
        env_vars = {}
    ray_env_vars = [
        '"'"'CUDA_VISIBLE_DEVICES'"'"', '"'"'RAY_CLIENT_MODE'"'"', '"'"'RAY_JOB_ID'"'"',
        '"'"'RAY_RAYLET_PID'"'"', '"'"'OMP_NUM_THREADS'"'"'
    ]
    env_dict = dict(os.environ)
    for env_var in ray_env_vars:
        if env_var in env_dict:
            env_vars[env_var] = env_dict[env_var]
    return env_vars

def run_bash_command_with_log(bash_command: str,
                              log_path: str,
                              env_vars: Optional[Dict[str, str]] = None,
                              stream_logs: bool = False,
                              with_ray: bool = False):
    with tempfile.NamedTemporaryFile('"'"'w'"'"', prefix='"'"'sky_app_'"'"',
                                     delete=False) as fp:
        bash_command = make_task_bash_script(bash_command, env_vars=env_vars)
        fp.write(bash_command)
        fp.flush()
        script_path = fp.name

        # Need this `-i` option to make sure `source ~/.bashrc` work.
        inner_command = f'"'"'/bin/bash -i {script_path}'"'"'

        subprocess_cmd: Union[str, List[str]]
        subprocess_cmd = inner_command

        return run_with_log(
            subprocess_cmd,
            log_path,
            stream_logs=stream_logs,
            with_ray=with_ray,
            # Disable input to avoid blocking.
            stdin=subprocess.DEVNULL,
            shell=True)

run_bash_command_with_log = ray.remote(run_bash_command_with_log)
if hasattr(autostop_lib, '"'"'set_last_active_time_to_now'"'"'):
    autostop_lib.set_last_active_time_to_now()

job_lib.set_status(2, job_lib.JobStatus.PENDING)
pg = ray_util.placement_group([{"CPU": 0.5}], '"'"'STRICT_SPREAD'"'"')
plural = '"'"'s'"'"' if 1 > 1 else '"'"''"'"'
node_str = f'"'"'1 node{plural}'"'"'

message = '"'"'INFO: Tip: use Ctrl-C to exit log streaming (task will not be killed).'"'"' + '"'"'\n'"'"'
message += f'"'"'INFO: Waiting for task resources on {node_str}. This will block if the cluster is full.'"'"'
print(message,
      flush=True)
# FIXME: This will print the error message from autoscaler if
# it is waiting for other task to finish. We should hide the
# error message.
ray.get(pg.ready())
print('"'"'INFO: All task resources reserved.'"'"',
      flush=True)

job_lib.set_job_started(2)
job_lib.scheduler.schedule_step()
@ray.remote
def check_ip():
    return ray.util.get_node_ip_address()
gang_scheduling_id_to_ip = ray.get([
    check_ip.options(
            num_cpus=0.5,
            scheduling_strategy=ray.util.scheduling_strategies.PlacementGroupSchedulingStrategy(
                placement_group=pg,
                placement_group_bundle_index=i
            )).remote()
    for i in range(pg.bundle_count)
])
print('"'"'INFO: Reserved IPs:'"'"', gang_scheduling_id_to_ip)

cluster_ips_to_node_id = {ip: i for i, ip in enumerate(['"'"'10.128.0.4'"'"'])}
job_ip_rank_list = sorted(gang_scheduling_id_to_ip, key=cluster_ips_to_node_id.get)
job_ip_rank_map = {ip: i for i, ip in enumerate(job_ip_rank_list)}
job_ip_list_str = '"'"'\n'"'"'.join(job_ip_rank_list)

sky_env_vars_dict = {}
sky_env_vars_dict['"'"'SKYPILOT_NODE_IPS'"'"'] = job_ip_list_str
# Backward compatibility: Environment starting with `SKY_` is
# deprecated. Remove it in v0.9.0.
sky_env_vars_dict['"'"'SKY_NODE_IPS'"'"'] = job_ip_list_str
sky_env_vars_dict['"'"'SKYPILOT_NUM_NODES'"'"'] = len(job_ip_rank_list)

sky_env_vars_dict['"'"'SKYPILOT_TASK_ID'"'"'] = '"'"'sky-2024-08-31-21-01-58-553026_mycluster_2'"'"'
sky_env_vars_dict['"'"'SKYPILOT_CLUSTER_INFO'"'"'] = '"'"'{"cluster_name": "mycluster", "cloud": "GCP", "region": "us-central1", "zone": "us-central1-a"}'"'"'
script = '"'"'echo "Hello, SkyPilot!"\nconda env list\n'"'"'
if run_fn is not None:
    script = run_fn(0, gang_scheduling_id_to_ip)


if script is not None:
    sky_env_vars_dict['"'"'SKYPILOT_NUM_GPUS_PER_NODE'"'"'] = 0
    # Backward compatibility: Environment starting with `SKY_` is
    # deprecated. Remove it in v0.9.0.
    sky_env_vars_dict['"'"'SKY_NUM_GPUS_PER_NODE'"'"'] = 0

    ip = gang_scheduling_id_to_ip[0]
    rank = job_ip_rank_map[ip]

    if len(cluster_ips_to_node_id) == 1: # Single-node task on single-node cluter
        name_str = '"'"'None,'"'"' if None != None else '"'"'task,'"'"'
        log_path = os.path.expanduser(os.path.join('"'"'~/sky_logs/sky-2024-08-31-21-01-58-553026/tasks'"'"', '"'"'run.log'"'"'))
    else: # Single-node or multi-node task on multi-node cluster
        idx_in_cluster = cluster_ips_to_node_id[ip]
        if cluster_ips_to_node_id[ip] == 0:
            node_name = '"'"'head'"'"'
        else:
            node_name = f'"'"'worker{idx_in_cluster}'"'"'
        name_str = f'"'"'{node_name}, rank={rank},'"'"'
        log_path = os.path.expanduser(os.path.join('"'"'~/sky_logs/sky-2024-08-31-21-01-58-553026/tasks'"'"', f'"'"'{rank}-{node_name}.log'"'"'))
    sky_env_vars_dict['"'"'SKYPILOT_NODE_RANK'"'"'] = rank
    # Backward compatibility: Environment starting with `SKY_` is
    # deprecated. Remove it in v0.9.0.
    sky_env_vars_dict['"'"'SKY_NODE_RANK'"'"'] = rank

    sky_env_vars_dict['"'"'SKYPILOT_INTERNAL_JOB_ID'"'"'] = 2
    # Backward compatibility: Environment starting with `SKY_` is
    # deprecated. Remove it in v0.9.0.
    sky_env_vars_dict['"'"'SKY_INTERNAL_JOB_ID'"'"'] = 2

    futures.append(run_bash_command_with_log \
            .options(name=name_str, num_cpus=0.5, scheduling_strategy=ray.util.scheduling_strategies.PlacementGroupSchedulingStrategy(placement_group=pg, placement_group_bundle_index=0)) \
            .remote(
                script,
                log_path,
                env_vars=sky_env_vars_dict,
                stream_logs=True,
                with_ray=True,
            ))
returncodes = get_or_fail(futures, pg)
if sum(returncodes) != 0:
    job_lib.set_status(2, job_lib.JobStatus.FAILED)
    # Schedule the next pending job immediately to make the job
    # scheduling more efficient.
    job_lib.scheduler.schedule_step()
    # This waits for all streaming logs to finish.
    time.sleep(0.5)
    reason = '"'"''"'"'
    # 139 is the return code of SIGSEGV, i.e. Segmentation Fault.
    if any(r == 139 for r in returncodes):
        reason = '"'"'(likely due to Segmentation Fault)'"'"'
    print('"'"'ERROR: Job 2 failed with '"'"'
          '"'"'return code list:'"'"',
          returncodes,
          reason,
          flush=True)
    # Need this to set the job status in ray job to be FAILED.
    sys.exit(1)
else:
    job_lib.set_status(2, job_lib.JobStatus.SUCCEEDED)
    # Schedule the next pending job immediately to make the job
    # scheduling more efficient.
    job_lib.scheduler.schedule_step()
    # This waits for all streaming logs to finish.
    time.sleep(0.5)
' > ~/.sky/sky_app/sky_job_2; } && $([ -s ~/.sky/python_path ] && cat ~/.sky/python_path 2> /dev/null || which python3) -u -c 'import os;import getpass;from sky.skylet import job_lib, log_lib, constants;job_owner_kwargs = {} if getattr(constants, "SKYLET_LIB_VERSION", 0) >= 1 else {"job_owner": getpass.getuser()};job_lib.scheduler.queue(2,'"'"'RAY_DASHBOARD_PORT=$($([ -s ~/.sky/python_path ] && cat ~/.sky/python_path 2> /dev/null || which python3) -c "from sky.skylet import job_lib; print(job_lib.get_job_submission_port())" 2> /dev/null || echo 8265);cd ~/sky_workdir && $([ -s ~/.sky/python_path ] && cat ~/.sky/python_path 2> /dev/null || which python3) $([ -s ~/.sky/ray_path ] && cat ~/.sky/ray_path 2> /dev/null || which ray) job submit --address=http://127.0.0.1:$RAY_DASHBOARD_PORT --submission-id 2-$(whoami) --no-wait "$([ -s ~/.sky/python_path ] && cat ~/.sky/python_path 2> /dev/null || which python3) -u ~/.sky/sky_app/sky_job_2 > ~/sky_logs/sky-2024-08-31-21-01-58-553026/run.log 2> /dev/null"'"'"')' failed with return code 255.
Failed to submit job 2.

Version & Commit info:
$ sky -v
skypilot, version 0.6.1
$ sky -c
skypilot, commit bc30c0b

@Michaelvll
Copy link
Collaborator

Thanks for raising the issue @kesitrifork! This issue is supposed to be fixed by #3884 . Could you have a try with that PR if it fixes your issue?

@Michaelvll
Copy link
Collaborator

#3884 has been merged. I am closing this issue for now. If the problem persists, please feel free to re-open the issue.

@kesitrifork
Copy link
Author

@Michaelvll I tried installing from source and now it fails more gracefully and shows an error message

E 09-03 19:21:22 subprocess_utils.py:84] mm_send_fd: sendmsg(2): Message too long
E 09-03 19:21:22 subprocess_utils.py:84] mux_client_request_session: send fds failed

Full log:

$ sky launch -c mycluster hello_sky.yaml
Task from YAML spec: hello_sky.yaml
Running task on cluster mycluster...
I 09-03 19:20:31 cloud_vm_ray_backend.py:1314] To view detailed progress: tail -n100 -f /Users/kevinsimper/sky_logs/sky-2024-09-03-19-20-31-230528/provision.log
I 09-03 19:20:33 provisioner.py:65] Launching on GCP us-central1 (us-central1-a)
I 09-03 19:20:50 provisioner.py:450] Successfully provisioned or found existing instance.
I 09-03 19:21:15 provisioner.py:552] Successfully provisioned cluster: mycluster
I 09-03 19:21:15 cloud_vm_ray_backend.py:3064] Syncing workdir (to 1 node): . -> ~/sky_workdir
I 09-03 19:21:15 cloud_vm_ray_backend.py:3072] To view detailed progress: tail -n100 -f ~/sky_logs/sky-2024-09-03-19-20-31-230528/workdir_sync.log
I 09-03 19:21:15 cloud_vm_ray_backend.py:3202] Running setup on 1 node.
bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
bash: no job control in this shell
bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
Running setup.
I 09-03 19:21:19 cloud_vm_ray_backend.py:3215] Setup completed.
E 09-03 19:21:22 subprocess_utils.py:84] mm_send_fd: sendmsg(2): Message too long
E 09-03 19:21:22 subprocess_utils.py:84] mux_client_request_session: send fds failed
E 09-03 19:21:22 subprocess_utils.py:84]
I 09-03 19:21:22 cloud_vm_ray_backend.py:3447]
I 09-03 19:21:22 cloud_vm_ray_backend.py:3447] Cluster name: mycluster
I 09-03 19:21:22 cloud_vm_ray_backend.py:3447] To log into the head VM:	ssh mycluster
I 09-03 19:21:22 cloud_vm_ray_backend.py:3447] To submit a job:		sky exec mycluster yaml_file
I 09-03 19:21:22 cloud_vm_ray_backend.py:3447] To stop the cluster:	sky stop mycluster
I 09-03 19:21:22 cloud_vm_ray_backend.py:3447] To teardown the cluster:	sky down mycluster
Clusters
NAME       LAUNCHED        RESOURCES              STATUS  AUTOSTOP  COMMAND
mycluster  a few secs ago  1x GCP(n2-standard-8)  UP      -         sky launch -c mycluster h...

sky.exceptions.CommandError: Command cd ~/sky_workdir && mkdir -p ~/sky_logs/sky-2024-09-03-19-20-31-230528 && touch ~/sky_logs/sky-2024-... failed with return code 255.
Failed to submit job 1.

@Michaelvll Michaelvll reopened this Sep 3, 2024
@Michaelvll
Copy link
Collaborator

Michaelvll commented Sep 3, 2024

Oops, sorry for the issue @kesitrifork! I could not test it on my end, so we missed a line in the previous fix. Could you help test it out on #3906? : )

@kesitrifork
Copy link
Author

@Michaelvll The quick start now works on GCP again, hooray! 🥳 Thank you for the help, I appreciate it!

@ckgresla
Copy link
Contributor

ckgresla commented Sep 9, 2024

If anyone else in the interim stumbles upon this, building the repo from source (with the merged changes) worked on my end:

git clone [email protected]:skypilot-org/skypilot.git
pip install -e '.[gcp]'
# + whatever other cloud providers you'd like
pip show skypilot  # verify you have a local install, ought a version like: 1.0.0.dev0

sky exec <your-cluster> <your-yaml>

@Michaelvll
Copy link
Collaborator

If anyone else in the interim stumbles upon this, building the repo from source (with the merged changes) worked on my end:

git clone [email protected]:skypilot-org/skypilot.git
pip install -e '.[gcp]'
# + whatever other cloud providers you'd like
pip show skypilot  # verify you have a local install, ought a version like: 1.0.0.dev0

sky exec <your-cluster> <your-yaml>

Thanks for sending the guide @ckgresla! You could also install from our nightly build if that is easier:

pip uninstall skypilot
pip install "skypilot-nightly[gcp]"

@ckgresla
Copy link
Contributor

ckgresla commented Sep 9, 2024

Ah great point, thank you @Michaelvll! 🙇

@Michaelvll Michaelvll added the OSS label Dec 19, 2024 — with Linear
@Michaelvll Michaelvll removed the OSS label Dec 19, 2024
@Michaelvll Michaelvll added the OSS label Dec 19, 2024 — with Linear
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants