Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize docker initialization #3823

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

camelop
Copy link

@camelop camelop commented Aug 9, 2024

Reduce ssh interaction round number on the happy path from 14 to 5 for docker initialization.
Related issue: #3661

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below) - No
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_docker_preinstalled_package
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for submitting this PR @camelop! This is awesome. Left several comments. : )

sky/provision/docker_utils.py Show resolved Hide resolved
if [[ $docker_installed -eq 1 ]]
then
# wait for docker daemon to be ready
docker_ready=0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about renaming this to docker_socket_ready?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, renamed.


# SkyPilot: Docker login if user specified a private docker registry.
cmd_login = ''
if 'docker_login_config' in self.docker_config:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we combine the following into the same call as the _check_host_status?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I am wondering as we may get a longer commands to run, will it be more readable if we explicitly create a enum as a state machine to indicate the stage we are in, and we can do post processing based on the state.

Copy link
Author

@camelop camelop Sep 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's combined to the next "action" call.
Currently it is following the structure of

  1. remote read-only status checking with sh
  2. local operation decision making in Python
  3. remote action

To further save SSH calls, I would recommend completely refactor the execution structure into the following:

  1. gather all local Python execution demands and render an execution script with it (e.g. with templating from Jinja2, similar to https://docs.ansible.com/ansible/latest/playbook_guide/playbooks_templating.html ). The script should potentially execute different branches depending on status on the target machine.
  2. once the script is rendered ready, send it to remote for execution
  3. gather the result in one round to summarize whether the change has been successfully replied

I would expect such change to be a global coding style / design pattern change, so not sure if it is a good fit for a minor opt PR like this.

Comment on lines +255 to +261
if 'container_status' in status:
if 'Exited' in status['container_status']:
logger.info('Container is exited but not removed.')
self.initialized = True
self._run(f'{self.docker_cmd} start {self.container_name}')
self._run('sudo service ssh start', run_env='docker')
return self._run('whoami', run_env='docker')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be integrated into the check_host_status command as well?

Copy link
Author

@camelop camelop Sep 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is not on the happy path though, but also follow the above design pattern (the action part in step 3).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that we may avoid having it in _check_host_status. This code path could happen if sky stop cluster-name and sky start cluster-name are called. We should probably merge these three continuous self._run into a single call?

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhh, I just realized that I have a few pending review comments forgot to sent from a while ago. Please find the comments below. Thanks for making the changes @camelop!!

Comment on lines +255 to +261
if 'container_status' in status:
if 'Exited' in status['container_status']:
logger.info('Container is exited but not removed.')
self.initialized = True
self._run(f'{self.docker_cmd} start {self.container_name}')
self._run('sudo service ssh start', run_env='docker')
return self._run('whoami', run_env='docker')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that we may avoid having it in _check_host_status. This code path could happen if sky stop cluster-name and sky start cluster-name are called. We should probably merge these three continuous self._run into a single call?

Comment on lines +274 to +277
cmd_login = f'{self.docker_cmd} login --username '\
f'{docker_login_config.username} '\
f'--password {docker_login_config.password} '\
f'{docker_login_config.server};'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style nit: avoid using backslash

Suggested change
cmd_login = f'{self.docker_cmd} login --username '\
f'{docker_login_config.username} '\
f'--password {docker_login_config.password} '\
f'{docker_login_config.server};'
cmd_login = (f'{self.docker_cmd} login --username '
f'{docker_login_config.username} '
f'--password {docker_login_config.password} '
f'{docker_login_config.server};')

Comment on lines +289 to +291
cmd_pull = f'{self.docker_cmd} image inspect {specific_image} '\
'1> /dev/null 2>&1 || '\
f'{self.docker_cmd} pull {specific_image};'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style nit: avoid using backslash

Suggested change
cmd_pull = f'{self.docker_cmd} image inspect {specific_image} '\
'1> /dev/null 2>&1 || '\
f'{self.docker_cmd} pull {specific_image};'
cmd_pull = (f'{self.docker_cmd} image inspect {specific_image} '
'1> /dev/null 2>&1 || '
f'{self.docker_cmd} pull {specific_image};')

container_running = self._check_container_status()

cmd_run = ''
container_running = 'Up' in status.get('container_status', False)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this mean by 'Up' in False mean here? Should we change the default value to a string?

Comment on lines +310 to +316
cmd_run ='[ -f /etc/docker/daemon.json ] || '\
'echo "{}" | sudo tee /etc/docker/daemon.json;'\
'sudo jq \'.["exec-opts"] ='\
'["native.cgroupdriver=cgroupfs"]\' '\
'/etc/docker/daemon.json > /tmp/daemon.json;'\
'sudo mv /tmp/daemon.json /etc/docker/daemon.json;'\
'sudo systemctl restart docker;'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style nit: avoid using backslash

Comment on lines +337 to +342
cmd_copy = f'{self.docker_cmd} cp ~/.ssh/authorized_keys '\
f'{container_name}:/tmp/host_ssh_authorized_keys;'\
'sudo systemctl stop jupyter > /dev/null 2>&1 || true;'\
'sudo systemctl disable jupyter > /dev/null 2>&1 || true;'\
'sudo systemctl stop jupyterhub > /dev/null 2>&1 || true;'\
'sudo systemctl disable jupyterhub > /dev/null 2>&1 || true;'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: backslash

Comment on lines +358 to +365
cmd_d_shell = f'echo \'{command_runner.ALIAS_SUDO_TO_EMPTY_FOR_ROOT_CMD}\' '\
'>> ~/.bashrc;'\
'echo "export DEBIAN_FRONTEND=noninteractive" >> ~/.bashrc;'\
'source ~/.bashrc;'
# Install dependencies.
self._run(
'sudo apt-get update; '
# Our mount script will install gcsfuse without fuse package.
# We need to install fuse package first to enable storage mount.
# The dpkg option is to suppress the prompt for fuse installation.
'sudo apt-get -o DPkg::Options::="--force-confnew" install -y '
'rsync curl wget patch openssh-server python3-pip fuse;',
run_env='docker')

# Copy local authorized_keys to docker container.
# Stop and disable jupyter service. This is to avoid port conflict on
# 8080 if we use default deep learning image in GCP, and 8888 if we use
# default deep learning image in Azure.
# Azure also has a jupyterhub service running on 8081, so we stop and
# disable that too.
container_name = constants.DEFAULT_DOCKER_CONTAINER_NAME
self._run(
f'rsync -e "{self.docker_cmd} exec -i" -avz ~/.ssh/authorized_keys '
f'{container_name}:/tmp/host_ssh_authorized_keys;'
'sudo systemctl stop jupyter > /dev/null 2>&1 || true;'
'sudo systemctl disable jupyter > /dev/null 2>&1 || true;'
'sudo systemctl stop jupyterhub > /dev/null 2>&1 || true;'
'sudo systemctl disable jupyterhub > /dev/null 2>&1 || true;',
run_env='host')
cmd_d_dep = 'sudo apt-get update; '\
'sudo apt-get -o DPkg::Options::="--force-confnew" install -y '\
'rsync curl wget patch openssh-server python3-pip fuse;'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: backslash

Comment on lines +379 to +384
cmd_d_port = f'sudo sed -i "s/#Port 22/Port {port}/" /etc/ssh/sshd_config;'\
'mkdir -p ~/.ssh;'\
'cat /tmp/host_ssh_authorized_keys >> ~/.ssh/authorized_keys;'\
'sudo service ssh start;'\
'sudo sed -i "s/mesg n/tty -s \&\& mesg n/" ~/.profile;'\
f'{SETUP_ENV_VARS_CMD};'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: backslash

def _configure_runtime(self, run_options: List[str]) -> List[str]:
def _configure_runtime(self,
run_options: List[str],
runtime_output=None) -> List[str]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
runtime_output=None) -> List[str]:
runtime_output: Optional[str]=None) -> List[str]:

def _auto_configure_shm(self, run_options: List[str]) -> List[str]:
def _auto_configure_shm(self,
run_options: List[str],
available_memory=None) -> List[str]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
available_memory=None) -> List[str]:
available_memory: Optional[int]=None) -> List[str]:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants