Skip to content

The fastest inexpensive way to get your new experiment into the clouds

License

Notifications You must be signed in to change notification settings

jeffhgs/weatherballoon

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Weatherballoon

Weatherballoon: The fastest inexpensive way to get your new experiment into the clouds.

Image from a hobby high-altitude balloon. Image copyright Noah Klugman

Synopsis

Weatherballoon takes any local machine command line run/test command and efficiently offloads the command's execution to cloud compute resources. Weatherballoon is especially well adapted to use with high end cloud computing resources, such as using cloud instances equipped with specialized GPU/TPU hardware to run deep learning model training.

Requirements

  • Run on high end cloud compute resources with confidence that you will not accidentally overspend
  • Automatically stop spending, limited only by the failure modes of the underlying cloud services
  • Easy installation: easily run on MacOS, Windows, and Linux, using well-established abstraction layers. Therefore, prefer embedding a dependency instead of depending on a local binary which might pose portability problems or too many degrees of configuration freedom, e.g. openssh.
  • Pluggable cloud service provisioner backends
  • Heartbeat-based failure detection
  • Unified configuration
  • Be efficient with:
    • Incurred fees for cloud resources
    • Wall clock time
  • In the presence of transient errors (networking, provisioning), never run a job more than once, but otherwise retry to ensure very high chance of a successful job attempt.
  • Interpret a job deciding to fail as a successful attempt

Installation

Prerequisites

Current prerequisites for the local machine are:

  • JDK 1.8
  • One portable third-party binary (rclone, a golang based cloud file synchronizer)
  • Credentials on one cloud compute service

Current prerequisites for the remote compute image are:

  • Ubuntu linux is the target test OS. Other linux-based distributions may work but are not currently tested.

Current supported cloud compute services are:

  • Amazon Elastic Compute Cloud (AWS EC2)

Installation Instructions

  • 1 - Unzip the distribution (currently 3 files) into a location in your path
  • 2 - Install rclone
  • 3 - run weatherballoon.sh

Usage

At this time, all weatherballoon configuration is done using the configuration files.

Weatherballoon command line invocation is as follows:

weatherballoon.sh -- <command to run remotely>

If a remote compute instance is currently available, it will be used to run the command. If no remote compute instance is available, one will be created, and then used to run the command.

Configuration

Specify your configuration with a file named .weatherballoon.json

Weatherballoon finds the .weatherballoon.json file starting with the current directory, and then with each parent of the current working directory.

An example .weatherballoon.json file is located in doc/sample_.weatherballoon.json:

{
  "provisioner":{
    "kind": "aws",
    "region": "us-west-2",
    "group1": "sg_temp1",
    "keyPair": "id_gs_temp_2019-01",
    "os": {
      "ami": "ami-0e3e4660d8725dd31",
      "username": "ubuntu"
    },
    "instanceType": "t2.medium",
    "cred": null,
    "gbsizeOfMainDisk": 40,
    "roleOfInstance":
      "arn:aws:iam::............:instance-profile/weatherballoon-ec2-accesses-s3all"
  },
  "tag": "Remoter",
  "minutesMaxRun": 120,
  "spooler": "tmux",
  "sync": {
    "adirLocal": null,
    "fileExcludes": null,
    "adirServer": "/home/ubuntu/srchome",
    "dirStorage": "weatherballoon-test1"
  }
}

Specify your credentials with a file named .weatherballoon_cred.json

Weatherballoon finds the .weatherballoon.json file starting with the current directory, and then with each parent of the current working directory.

An example .weatherballoon.json file:

{
  "id": ". . .",
  "secret": ". . ."
}

Configuration: ssh

Weatherballoon embeds its own ssh client. The client is configured to read openssh-style private key files. The private key files should be put in the directory ~/.ssh/name_of_key.

For AWS, the public part of the key should be uploaded to the AWS console named with "name_of_key", same as the file.

Authorization

Authorization: AWS

Authorization: AWS: User, quick and dirty

Two credentials are required: a user for running on your local machine, and a role for EC2.

Create an IAM user for "Programmatic access", to run on your local machine. For more information about creating users for programmatic access, consult the IAM user guide:

https://docs.aws.amazon.com/IAM/latest/UserGuide/id_users_create.html

Deposit the user's access keys into a new file ~/.weatherballoon_cred.json:

{
  "id": "...",
  "secret": "..."
}

Set permissions of ~/.weatherballoon_cred.json to be visible only to the current user. (Hint: compare permissions to ~/.ssh/id_rsa)

Grant the user the following policies:

AmazonEC2FullAccess
AmazonS3FullAccess
IAMFullAccess
Authorization: AWS: Role, quick and dirty

Now we need to create the role for ec2 to run weatherballoon.

  • Go to the AWS Web Console
  • Under Services, select "IAM"
  • Under Roles, select "Create role"
  • Under "Select the type of trusted entity", select "AWS Service"
  • Under "Choose the service that will use this role", select "EC2"
  • Select "Next: Permissions"
  • In the search box, type "s3full", check "AmazonS3FullAccess", and select "Next: Tags"
  • Select "Next: Review"
  • Under "Role name", select "weatherballoon-ec2-accesses-s3all"
  • Select "Create role"

Now AWS has the role for running weatherballoon and we need to tell weatherballoon about it.

  • If you have not already, copy the example doc/sample_.weatherballoon.json to .weatherballoon.json in the directory where you plan to work.
  • Open .weatherballoon.json for editing
  • Under IAM, under roles, the table will have a new entry.
  • Select the hyperlink for "weatherballoon-ec2-accesses-s3all"
  • Under Role ARN, select and copy the text arn:aws:iam::............:instance-profile/weatherballoon-ec2-accesses-s3all
  • Make sure you did not copy the text arn:aws:iam::............:role/weatherballoon-ec2-accesses-s3all
  • Paste the role ARN into your .weatherballoon.json, under the field "arn"
Authorization: AWS: least privilege

Authorization can be weakened as follows:

The user and role need only write to the s3 buckets specified in .weatherballoon.json

The user does not really need full IAM access, but only:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "iam:PassRole",
            "Resource": "arn:aws:iam::............:role/*"
        }
    ]
}

Remember to insert your own account number when granting this policy.

Note that the arn in the policy contains "role/" not "instance-policy/".

Note that the "*" can be further restricted to passing of the specific role you have created.

Known Issues

  • The number of minutes before termination is hardcoded as numMinutes=15 in src/main/resources/file/install_heartbeat_cron.sh
  • The configuration key sync.fileExcludes was broken porting from rsync to rclone
  • Credentials cannot be encrypted
  • There is only one cloud provisioner (AWS)

Development

To recompile weatherballoon, use the following gradle command:

./gradlew zip

The distribution zip will be built at build/weatherballoon.zip

Revision History

Release 0.1:

  • AWS EC2 backend implementation
  • Server shell scripts compatible with Ubuntu
  • Basic configuration
  • JCE extensions workaround: Distribute binaries as two jar files instead of one
  • Create README.md file

Release 0.1.1:

Features:

  • Configurable capacity of main disk
  • Show rclone progress
  • Directories can be optionally excluded from rclone
  • Optional spooling of jobs via tmux
  • Display stderr of commands
  • Clarify requirements for retry behavior
  • Bring retry behavior closer to stated requirements

Bug fixes:

  • Fixed: global job timeout
  • Fixed: don't fail if status file /var/log/userdata-done gets baked into AMI

Release Roadmap

Future release 0.9: Error handling, configuration, and documentation

  • Abstraction layer for pluggable cloud backends
  • Helpful messaging for authentication, authorization, and configuration errors
  • Validate prerequisite rclone installation
  • Semantic versioning for configuration file
  • Distribution as one jar file not two
  • Configure installed shell scripts from unified configuration file to admit target OSes other than ubuntu
  • Documentation: Reference
  • Documentation: Installation guide

Future release 1.0: A second cloud provisioner

  • Automation for authorization according to principle of least privilege
  • Implement a second cloud compute backend

Future releases past 1.0

  • More cloud vendor backends
  • Extension point to enable low latency log aggregation and streaming (E.g. cloudwatch logs)
  • Extension point for extra supervision features based on serverless function services (E.g. AWS Lambda)
  • Storage of credentials in workstation keychain
  • Configuration of remote OS images that is invariant of cloud provisioner

About

The fastest inexpensive way to get your new experiment into the clouds

Resources

License

Stars

Watchers

Forks

Packages

No packages published