This repository contains our Docker-compose and setup bootstrap scripts used to create a deployment of the UCSC Genomic Institute's Computational Genomics Platform (CGP) for the NIH Data Commons Pilot on AWS. It uses, supports, and drives development of several key GA4GH APIs and open source projects. In many ways it is the generalization of the PCAWG cloud infrastructure developed for that project and a potential reference implementation for the NIH Commons concept.
The system has components fulfilling a range of functions, all of which are open source and can be used independently or together. The installation instructions are currently specific to AWS, but usage of other cloud service providers is planned for the future.
These components are setup with the install process available in this repository:
- Boardwalk: our file browsing portal on top of the Azul Indexer.
Related projects that are either already setup and available for use on the web or are used by components above:
- Dockstore: our workflow and tool sharing platform
Use the AWS console or command line tool to create a host virtual machine. While you do this make a note of your security group name and ID and ensure you can connect via ssh. We will refer to this virtual machine as the VM throughout the rest of the documentation. Ultimately the performance and size of the VM depends on the traffic you expect. (Note: We have had problems when uploading big files to Virginia (~25GB). If possible, set up your AWS anywhere else but Virginia.)
The following specification has worked well for a small-scale production environment :
- Ubuntu Server 16.04
- r4.xlarge
- 250GB disk
For small scale development, a t2.medium with 20GB disks running Ubuntu Server 16.04 may be suitable. For more intensive development, a m5.large instance with 60 GB disk has been sufficient.
- Go to AWS console, Compute, EC2 Dashboard. There click running instances.
- Find your VM and make it active by clicking anywhere in that line.
- Under the drop-down Actions, go to Networking, Manage IP Addreses.
- Click on Allocate an Elastic IP. It will automatically create an IP address and show it on the screen.
- Write down that Elastic IP. At this point, whatever IP address is shown under IPv4 Public IP for your VM is not the current instance's IP address. Next you need to associate the Elastic IP with your VM.
- Back in EC2 Dashboard go to Elastic IPs. The IP address just created should be in that list. Check it and under Actions, Associate, choose resource type "Instance", and choose your EC2 (e.g., searching by its name).
- In EC2 Dashboard make your VM active by clicking it. Then click Connect on top. The example in that window shows you how to ssh into your VM from a terminal.
You will need to create a DNS record that maps to your VM. On the AWS console, go to route 53 service, click Hosted zones. If you already have Hosted zone you can use, select it from the list. If not, create one to suit your needs. Once the zone is selected, click Create Record Set at the top. Choose a name you like and use the elastic ip as the value.
Open inbound ports on your security group. Use the table below as a guide. Make sure you add /32 to the Elastic IP.
Type | Port | Source | Description |
---|---|---|---|
HTTP | 80 | 0.0.0.0/0 | |
HTTP | 80 | ::/0 | |
HTTPS | 443 | 0.0.0.0/0 | |
HTTPS | 443 | ::/0 | |
All TCP | 0 - 65535 | Your VM's Elastic IP | |
All TCP | 0 - 65535 | Your Security Group ID | |
Custom TCP Rule | 9000 | Your VM's Elastic IP | webservice port |
Custom TCP Rule | 9200 | Your VM's Elastic IP | Elasticsearch |
SSH | 22 | 0.0.0.0/0 |
On your local machine add the key pair file under
~/.ssh/<your_key_pair>.pem
. This is typically the same key pair that
you use to connect to your VM via SSH. This key pair needs to be created
on the AWS
console
so Amazon is aware of it. Set the privileges of that key pair file to
read-by-user-only by chmod 400 ~/.ssh/<your_key>.pem
so it is not
publicly viewable.
The NIH Data Commons (DCPPC) uses BDBags to move metadata from one platform to another. In Boardwalk a BDBag is created by clicking Export to FireCloud. Once clicked the selected metadata are packaged in a BDBag, and the bag is uploaded to an S3 bucket. Therefore, part of the installation process is creating an S3 bucket. Follow these steps to create it
- In the AWS console head over to S3.
- Click Create bucket. Name your bucket (needs to be a unique), set the region and click Next. Take note of the bucket name, you'll be asked for it later during the installation.
- In the next two tabs, Configure options and Set permissions, leave the default settings and click Next.
- Review the settings, and click Create bucket.
Next we want to limit the lifecycle of objects in that bucket to 1 day (technically is only needs to exists for a few minutes). To do that, in Amazon S3
- Search for the bucket you just created and click on it.
- Go the Management tab, and in there to the tab Lifecycle, + Add lifecycle rule.
- Name your rule (e.g., "limit to 1 day") and click Next.
- Accept the default settings in tab Transistions, and click Next.
- In the Expiration tab, click Current version. That checks Expire current version of object. In the prompt enter "1" for expiration time of day from object creation. Click Next.
- Review the settings. The scope should include the whole bucket. Click Save.
Now you can begin installation. You will clone this repository on VM and
run the bootstrap script. Be sure to set your branch to
feature/commons
as these instruction are specific to this branch.
-
SSH onto your VM you created.
-
clone this repo
git clone https://github.com/DataBiosphere/cgp-deployment.git
-
Change the repo's directory
cd cgp-deployment
-
Check out the correct branch
git checkout feature/commons
-
Run the install script
sudo bash install_bootstrap
The install script will ask for lots of information. Here we'll explain what to put for each step.
-
First the Installer asks to install Docker and other dependencies. These are necessary to continue.
-
Next, the install will ask to launch the public-facing gateway nginx server. You can do this in either
dev
mode orprod
mode. For more details see the details for installing in dev vs. prod mode. -
Now you will have to decide whether to launch boardwalk in
dev
orprod
modes. -
Next you are asked to decide if you want an authorization whitelist.
If you decide to use a whitelist you will need set it up with bouncer. You will also be prompted for a project name and a contact email which will be used in error messages for users who aren't authorized, but attempt to access the data.
-
You will now have to enter the hostname of dcc-dashboard. This is the domain you made in Route 53, above. You need to know the name of the domain, but at the time of installation the domain (or record set) does not actually have to be configured in Route 53
-
Next you are prompted for the AWS region to be used and for AWS credentials. The region is the one you whitelist, buckets, etc. are in. The credentials can be made specifically for the instance, or for a dev instance you could use your own.
-
The elasticsearch instance domain can be one that is prexisting if used only for development.
-
Provide the Google Client ID and Client Secret you got from the OAuth2 app. Instructions on how to set this up can be found on the Boardwalk deployment page. Also, enter your Google site verfication code if you have one, otherwise enter
NONE
. -
The S3 bucket you created earlier is what you should use for this next step. Provide the name you assigned to the bucket, and the AWS region that the bucket resides in.
-
The dos-dss server is needed next. We use the cgp-data-store.
Installing can happen in dev
mode for development, of prod
mode for
production.
This decision can be made for Common and Boardwalk independently. For details regarding Boardwalk see the README.
Once the above steps have been completed we are now ready to install the
components of the CGP. In prod
mode the installation will run the
Docker containers for all of the components listed below from the
respective images from Quay.io. The nginx
docker will be built from
the nginx-image directory.
Setting up Common (the gateway nginx server) to run in dev
mode will
cause Let's Encrypt to issue fake SSL
certificates, which won't exhaust your certificate's limit. Setting up
Boardwalk to run in dev
mode will first build then run the Docker
containers boardwalk_nginx
, boardwalk_dcc-dashboard
,
boardwalk_dcc-dashboard-service
, and boardwalk_boardwalk
from the
images (see
here
for more details). In addition, the nginx
image is built from the
nginx-dev directory. If your work requires real SSL certificates
during development, it is recommended to set up Common in prod
mode,
and Boardwalk in dev
mode.
Once the installer completes, the system should be up and running.
Congratulations! Execute sudo docker ps
to get an idea of which
containers are running.
To test that the required number of Docker containers is successfully
running execute cd test && ./integration.sh
. This sends several HTTP-GETs
to the configured DCC dashboard host and checks whether the responses are
as expected.
-
If boardwalk works fine over
http
but not overhttps
you can try restarting the docker containers.cd common
sudo docker-compose -f base.yml -f prod.yml down
sudo docker-compose -f base.yml -f prod.yml up -d
-
If you switch the gateway nginx server (Common) from
dev
toprod
you will also need to release and create a new elastic IP. Also kill the docker containers, delete the images, and try running the install script again. -
If all else fails, you can open an issue or contact a human.
- This blog post is helpful if you want to clean up previous images/containers/volumes.
- the bootstrapper should install Java, Dockstore CLI