-
Notifications
You must be signed in to change notification settings - Fork 547
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize docker initialization #3823
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for submitting this PR @camelop! This is awesome. Left several comments. : )
sky/provision/docker_utils.py
Outdated
if [[ $docker_installed -eq 1 ]] | ||
then | ||
# wait for docker daemon to be ready | ||
docker_ready=0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about renaming this to docker_socket_ready
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, renamed.
|
||
# SkyPilot: Docker login if user specified a private docker registry. | ||
cmd_login = '' | ||
if 'docker_login_config' in self.docker_config: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we combine the following into the same call as the _check_host_status
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I am wondering as we may get a longer commands to run, will it be more readable if we explicitly create a enum as a state machine to indicate the stage we are in, and we can do post processing based on the state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's combined to the next "action" call.
Currently it is following the structure of
- remote read-only status checking with sh
- local operation decision making in Python
- remote action
To further save SSH calls, I would recommend completely refactor the execution structure into the following:
- gather all local Python execution demands and render an execution script with it (e.g. with templating from Jinja2, similar to https://docs.ansible.com/ansible/latest/playbook_guide/playbooks_templating.html ). The script should potentially execute different branches depending on status on the target machine.
- once the script is rendered ready, send it to remote for execution
- gather the result in one round to summarize whether the change has been successfully replied
I would expect such change to be a global coding style / design pattern change, so not sure if it is a good fit for a minor opt PR like this.
if 'container_status' in status: | ||
if 'Exited' in status['container_status']: | ||
logger.info('Container is exited but not removed.') | ||
self.initialized = True | ||
self._run(f'{self.docker_cmd} start {self.container_name}') | ||
self._run('sudo service ssh start', run_env='docker') | ||
return self._run('whoami', run_env='docker') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be integrated into the check_host_status
command as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is not on the happy path though, but also follow the above design pattern (the action part in step 3).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that we may avoid having it in _check_host_status
. This code path could happen if sky stop cluster-name
and sky start cluster-name
are called. We should probably merge these three continuous self._run
into a single call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahhh, I just realized that I have a few pending review comments forgot to sent from a while ago. Please find the comments below. Thanks for making the changes @camelop!!
if 'container_status' in status: | ||
if 'Exited' in status['container_status']: | ||
logger.info('Container is exited but not removed.') | ||
self.initialized = True | ||
self._run(f'{self.docker_cmd} start {self.container_name}') | ||
self._run('sudo service ssh start', run_env='docker') | ||
return self._run('whoami', run_env='docker') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that we may avoid having it in _check_host_status
. This code path could happen if sky stop cluster-name
and sky start cluster-name
are called. We should probably merge these three continuous self._run
into a single call?
cmd_login = f'{self.docker_cmd} login --username '\ | ||
f'{docker_login_config.username} '\ | ||
f'--password {docker_login_config.password} '\ | ||
f'{docker_login_config.server};' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style nit: avoid using backslash
cmd_login = f'{self.docker_cmd} login --username '\ | |
f'{docker_login_config.username} '\ | |
f'--password {docker_login_config.password} '\ | |
f'{docker_login_config.server};' | |
cmd_login = (f'{self.docker_cmd} login --username ' | |
f'{docker_login_config.username} ' | |
f'--password {docker_login_config.password} ' | |
f'{docker_login_config.server};') |
cmd_pull = f'{self.docker_cmd} image inspect {specific_image} '\ | ||
'1> /dev/null 2>&1 || '\ | ||
f'{self.docker_cmd} pull {specific_image};' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style nit: avoid using backslash
cmd_pull = f'{self.docker_cmd} image inspect {specific_image} '\ | |
'1> /dev/null 2>&1 || '\ | |
f'{self.docker_cmd} pull {specific_image};' | |
cmd_pull = (f'{self.docker_cmd} image inspect {specific_image} ' | |
'1> /dev/null 2>&1 || ' | |
f'{self.docker_cmd} pull {specific_image};') |
container_running = self._check_container_status() | ||
|
||
cmd_run = '' | ||
container_running = 'Up' in status.get('container_status', False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this mean by 'Up' in False
mean here? Should we change the default value to a string?
cmd_run ='[ -f /etc/docker/daemon.json ] || '\ | ||
'echo "{}" | sudo tee /etc/docker/daemon.json;'\ | ||
'sudo jq \'.["exec-opts"] ='\ | ||
'["native.cgroupdriver=cgroupfs"]\' '\ | ||
'/etc/docker/daemon.json > /tmp/daemon.json;'\ | ||
'sudo mv /tmp/daemon.json /etc/docker/daemon.json;'\ | ||
'sudo systemctl restart docker;' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style nit: avoid using backslash
cmd_copy = f'{self.docker_cmd} cp ~/.ssh/authorized_keys '\ | ||
f'{container_name}:/tmp/host_ssh_authorized_keys;'\ | ||
'sudo systemctl stop jupyter > /dev/null 2>&1 || true;'\ | ||
'sudo systemctl disable jupyter > /dev/null 2>&1 || true;'\ | ||
'sudo systemctl stop jupyterhub > /dev/null 2>&1 || true;'\ | ||
'sudo systemctl disable jupyterhub > /dev/null 2>&1 || true;' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: backslash
cmd_d_shell = f'echo \'{command_runner.ALIAS_SUDO_TO_EMPTY_FOR_ROOT_CMD}\' '\ | ||
'>> ~/.bashrc;'\ | ||
'echo "export DEBIAN_FRONTEND=noninteractive" >> ~/.bashrc;'\ | ||
'source ~/.bashrc;' | ||
# Install dependencies. | ||
self._run( | ||
'sudo apt-get update; ' | ||
# Our mount script will install gcsfuse without fuse package. | ||
# We need to install fuse package first to enable storage mount. | ||
# The dpkg option is to suppress the prompt for fuse installation. | ||
'sudo apt-get -o DPkg::Options::="--force-confnew" install -y ' | ||
'rsync curl wget patch openssh-server python3-pip fuse;', | ||
run_env='docker') | ||
|
||
# Copy local authorized_keys to docker container. | ||
# Stop and disable jupyter service. This is to avoid port conflict on | ||
# 8080 if we use default deep learning image in GCP, and 8888 if we use | ||
# default deep learning image in Azure. | ||
# Azure also has a jupyterhub service running on 8081, so we stop and | ||
# disable that too. | ||
container_name = constants.DEFAULT_DOCKER_CONTAINER_NAME | ||
self._run( | ||
f'rsync -e "{self.docker_cmd} exec -i" -avz ~/.ssh/authorized_keys ' | ||
f'{container_name}:/tmp/host_ssh_authorized_keys;' | ||
'sudo systemctl stop jupyter > /dev/null 2>&1 || true;' | ||
'sudo systemctl disable jupyter > /dev/null 2>&1 || true;' | ||
'sudo systemctl stop jupyterhub > /dev/null 2>&1 || true;' | ||
'sudo systemctl disable jupyterhub > /dev/null 2>&1 || true;', | ||
run_env='host') | ||
cmd_d_dep = 'sudo apt-get update; '\ | ||
'sudo apt-get -o DPkg::Options::="--force-confnew" install -y '\ | ||
'rsync curl wget patch openssh-server python3-pip fuse;' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: backslash
cmd_d_port = f'sudo sed -i "s/#Port 22/Port {port}/" /etc/ssh/sshd_config;'\ | ||
'mkdir -p ~/.ssh;'\ | ||
'cat /tmp/host_ssh_authorized_keys >> ~/.ssh/authorized_keys;'\ | ||
'sudo service ssh start;'\ | ||
'sudo sed -i "s/mesg n/tty -s \&\& mesg n/" ~/.profile;'\ | ||
f'{SETUP_ENV_VARS_CMD};' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: backslash
def _configure_runtime(self, run_options: List[str]) -> List[str]: | ||
def _configure_runtime(self, | ||
run_options: List[str], | ||
runtime_output=None) -> List[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
runtime_output=None) -> List[str]: | |
runtime_output: Optional[str]=None) -> List[str]: |
def _auto_configure_shm(self, run_options: List[str]) -> List[str]: | ||
def _auto_configure_shm(self, | ||
run_options: List[str], | ||
available_memory=None) -> List[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
available_memory=None) -> List[str]: | |
available_memory: Optional[int]=None) -> List[str]: |
Reduce ssh interaction round number on the happy path from 14 to 5 for docker initialization.
Related issue: #3661
Tested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_docker_preinstalled_package
conda deactivate; bash -i tests/backward_compatibility_tests.sh