Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add containers for Toil appliance (resolves #159) #160

Closed

Conversation

fnothaft
Copy link
Contributor

@fnothaft fnothaft commented Aug 2, 2016

First pass at putting together Toil appliance containers for DataBiosphere/toil#1088 slash #159. CC @cket. Still needs some TLC:

  • Need to come up with a set of tests that we can run (make test does nothing)
  • I need to test building this
  • We probably want this to point at unstable Toil releases for now. Right now, I'm pointing at 3.3.0.

@fnothaft
Copy link
Contributor Author

fnothaft commented Aug 3, 2016

TODOs:

  • @cket to PR with small fixes against this branch
  • @fnothaft to circle with @briandoconnor about single vs. multiple docker images
  • @cket to test on cluster
  • @cket to post error message (read only file system)
  • @fnothaft to look at read only file system error
  • @cket to write unit test --> launch leader/worker containers, run small toil script
  • @fnothaft determine if/when we should implement user data for Mesos master discovery
  • @cket to write provisioner script to launch toil leader instance
  • @cket to insure mesos indicates whether the node is pre-emptible or not

@cket
Copy link
Contributor

cket commented Aug 3, 2016

Error message thrown when launching container with mesos-slave as the entry point:

14:11:09 toil-appliance $ docker run -it d67def2f0ef5 --master=172.17.0.2:5050 --work_dir=/tmp/
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0803 21:11:15.851196 1 main.cpp:243] Build: 2016-07-27 20:23:20 by ubuntu
I0803 21:11:15.852200 1 main.cpp:244] Version: 1.0.0
I0803 21:11:15.852464 1 main.cpp:247] Git tag: 1.0.0
I0803 21:11:15.852957 1 main.cpp:251] Git SHA: c9b70582e9fccab8f6863b0bd3a812b5969a8c24
I0803 21:11:15.858079 1 containerizer.cpp:196] Using isolation: posix/cpu,posix/mem,filesystem/posix,network/cni
Failed to create a containerizer: Could not create MesosContainerizer: Failed to create launcher: Failed to create Linux launcher: Failed to create root cgroup /sys/fs/cgroup/freezer/mesos: Failed to create directory '/sys/fs/cgroup/freezer/mesos': Read-only file system

@fnothaft
Copy link
Contributor Author

fnothaft commented Aug 3, 2016

@cket looks like this is related to MESOS-3498. I will look more.

@fnothaft
Copy link
Contributor Author

fnothaft commented Aug 3, 2016

@cket appears a fix may be to export MESOS_LAUNCHER=posix as per MESOS-3793. We should add this to the Dockerfile.

@cket
Copy link
Contributor

cket commented Aug 3, 2016

@fnothaft that fix seems to work, I included it in the PR I just filed

@fnothaft
Copy link
Contributor Author

fnothaft commented Aug 3, 2016

@fnothaft that fix seems to work, I included it in the PR I just filed

Yahtzee! I'll merge that in a sec.

@cket
Copy link
Contributor

cket commented Aug 24, 2016

The autoscaling code will be responsible for provisioning worker nodes but we need a separate leader provisioning script to start the toil leader on EC2. Since Toil will run on the leader node this provisioning can't be done as part of the workflow. Instead @hannes-ucsc and I propose that this script should be included as part of Toil with the AWS extra and will spin up an instance, propagate the AWS credentials, and start the toil-leader container.

If the user doesn't wish to pip install Toil they can alternatively use the script from within the Toil appliance to launch the leader.

@fnothaft
Copy link
Contributor Author

+1 to that approach

@cket
Copy link
Contributor

cket commented Aug 25, 2016

@hannes-ucsc and I also discussed using IAM roles for the worker nodes and how to insure that the block mapping is done properly for instances with ephemeral volumes - unfortunately the ami we are using doesn't mount any volumes for us, and will only attach a maximum of 1 volume if present. Instances with >1 block devices are left up to us to attach and raid.

Block mapping will be handled by the leader prior to launching the instances and I wrote a script to be included in the user data that will discover the devices, raid, and mount them appropriately.

@hannes-ucsc
Copy link
Contributor

Another thing that we tentatively settled on is to using CoreOS. It provides VM images with Docker for AWS, Azure and Google Compute.

We call the program that is used to launch the leader the "bootstrapper". I propose that we'll have distinct bootstrappers for each cloud, e.g. toil-azure, toil-google and toil-aws, each one a separate entry point. As CJ mentioned, the bootstrappers should be part of Toil. Since launching a leader VM is very similar to launching a worker VM, the bootstrapping code should live in the provisioner. We'll add a createLeader method to the provisioner API.

@cket, with this design we can use IAM roles for the leader, too. No need to copy credentials. CGCloud has code for most of this already, so the strategy should be to move sharable functionality from cgcloud-core/src/…/box.py to cgcloud-lib/src/…/ec2.py and reuse that in Toil. This approach was useful for me with the initial provisioner implementation. cgcloud-lib has no exotic dependencies so there are is no detriment to Toil depending on it.

@hannes-ucsc
Copy link
Contributor

+1 from me. There's a bunch of CRs missing at EOF. This is on the critical path towards the next Toil release. @cket, I propose that you hijack this PR (close this one, continue the branch with your own changes and open a new PR with the result) instead of asking @fnothaft to merge your PR against his. Make sure you don't squash commits from different authors so that authorship is retained in the history. Would that be ok with you, @fnothaft? Any other process that gets this merged tomorrow, Wed would be fine with me.

@fnothaft
Copy link
Contributor Author

fnothaft commented Sep 7, 2016

OK by me.

Make sure you don't squash commits from different authors so that authorship is retained in the history.

Squashing is OK too.

@cket
Copy link
Contributor

cket commented Sep 7, 2016

continued in #176

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants