Merge pull request #129 from ENCODE-DCC/dev

v1.6.3
ENCODE-DCC · Jun 3, 2021 · 0ba420b · 0ba420b
2 parents 95c4fd0 + 894100e
commit 0ba420b
Show file tree

Hide file tree

Showing 8 changed files with 181 additions and 88 deletions.
diff --git a/README.md b/README.md
@@ -73,7 +73,12 @@ Caper is based on Unix and cloud platform CLIs (`curl`, `gsutil` and `aws`) and
 	gcp-prj | Google Cloud Platform Project
 	gcp-out-dir | Output bucket path for Google Cloud Platform. This should start with `gs://`.
 
-7) To use Caper on Google Cloud Platform (GCP), [configure for GCP](docs/conf_gcp.md). To use Caper on Amazon Web Service (AWS), [configure for AWS](docs/conf_aws.md).
+7) To use Caper on Google Cloud Platform (GCP), we provide a shell script to create a Caper server instance on Google Cloud.
+See [this](scripts/gcp_caper_server/README.md) for details.
+
+8) To use Caper on Amazon Web Service (AWS), we provide a shell script to create a Caper server instance on AWS.
+See [this](scripts/aws_caper_server/README.md) for details.
+
 
 ## Output directory
 
@@ -546,18 +551,6 @@ This file DB is genereted on your working directory by default. Its default file
 Unless you explicitly define `file-db` in your configuration file `~/.caper/default.conf` this file DB name will depend on your input JSON filename. Therefore, you can simply resume a failed workflow with the same command line used for starting a new pipeline.
 
 
-## Caper server instance on Google Cloud
-
-We provide a shell script to create a Caper server instance on Google Cloud.
-See [this](scripts/gcp_caper_server/README.md) for details.
-
-
-## Caper server instance on AWS
-
-We provide a shell script to create a Caper server instance on AWS.
-See [this](scripts/aws_caper_server/README.md) for details.
-
-
 # DETAILS
 
 See [details](DETAILS.md).

diff --git a/caper/__init__.py b/caper/__init__.py
@@ -2,4 +2,4 @@
 from .caper_runner import CaperRunner
 
 __all__ = ['CaperClient', 'CaperClientSubmit', 'CaperRunner']
-__version__ = '1.6.2'
+__version__ = '1.6.3'
diff --git a/docs/conf_aws.md b/docs/conf_aws.md
@@ -1,41 +1 @@
-## Configuration for S3 storage access
-
-1. Sign up for an [AWS account](https://aws.amazon.com/account/).
-
-2. Make sure that your account has permission on two services (S3 and EC2).
-- Admin: full permission on both EC2 and output S3 bucket.
-- User: read/write permission on the output S3 bucket.
-
-3. Configure your AWS CLI. Enter key and password obtained from your account's IAM.
-```bash
-$ aws configure
-```
-
-## Configuration for AWS backend
-
-Please follow the above instruction for S3 storage access.
-
-1. Click on [this](
-https://console.aws.amazon.com/cloudformation/home?#/stacks/new?stackName=GenomicsVPC&templateURL=https://aws-quickstart.s3.amazonaws.com/quickstart-aws-vpc/templates/aws-vpc.template.yaml) to create a new AWS VPC. Make sure that the region on top right corner of the console page matches with your region of interest. Click on `Next` and then `Next` again. Agree to `Capabililties`. Click on `Create stack`.
-
-2. Choose all available zones in `Availability Zones`. For example, if your region is `us-east-2`, then you will see `us-east-2a`, `us-east-2b` and  `us-east-2c`. Choose all.
-
-3. Click on [this](
-https://console.aws.amazon.com/cloudformation/home?#/stacks/new?stackName=gwfcore&templateURL=https://aws-genomics-workflows.s3.amazonaws.com/v3.0.6.1/templates/gwfcore/gwfcore-root.template.yaml) to create a new AWS Batch. Make sure that the region on top right corner of the console page matches with your region of interest. Click on `Next`.
-
-4. There are several required parameters to be specified on this page
-- `S3 Bucket name`: S3 bucket name to store your pipeline outputs. This is not a full path for the output directory. It's just bucket's name.
-- `Existing Bucket?`: `True` if the above bucket already exists.
-- `VPC ID`: Choose the VPC `GenomicsVPC` that you just created.
-- `VPC Subnet IDs`: Choose two private subnets created with the above VPC.
-- (**IMPORTANT**) `Template Root URL`: `https://caper-aws-genomics-workflows.s3-us-west-1.amazonaws.com/src/templates`.
-
-5. Click on `Next` and then `Next` again. Agree to `Capabililties`. Click on `Create stack`.
-
-6. Go to your [AWS Batch](https://console.aws.amazon.com/batch) and click on `Job queues` in the left sidebar. Click on `default-*`. Get ARN for your batch under the key `Queue ARN`. This ARN will be used later to create Caper server instance.
-
-
-
-## References
-
-https://docs.opendata.aws/genomics-workflows/orchestration/cromwell/cromwell-overview/
+Deprecated. Please see [this](../scripts/aws_caper_server/README.md) instead.
diff --git a/docs/conf_gcp.md b/docs/conf_gcp.md
@@ -1,3 +1,7 @@
+Deprecated. Please see [this](../scripts/gcp_caper_server/README.md) instead.
+
+# DEPRECATED
+
 # Configuration for Google Cloud Platform backend (`gcp`)
 
 1. Sign up for a Google account.

diff --git a/scripts/aws_caper_server/README.md b/scripts/aws_caper_server/README.md
@@ -3,14 +3,35 @@
 `create_instance.sh` will create a new Caper server instance on your AWS EC2 region and configure the instance for Cromwell with PostgreSQL database.
 
 
-## Requirements
+## AWS account
 
-Follow these two instructions before running the shell script.
-- [Configuration for S3 storage access](../../docs/conf_aws.md#Configuration-for-S3-storage-access)
-- [Configuration for AWS backend](../../docs/conf_aws.md#Configuration-for-AWS-backend)
+1. Sign up for an [AWS account](https://aws.amazon.com/account/).
+2. Make sure that your account has full permission on two services (S3 and EC2).
+3. Configure your AWS CLI. Enter key, secret (password) and region (**IMPORTANT**) obtained from your account's IAM.
+```bash
+$ aws configure
+```
+
+## VPC
+
+1. Click on [this](
+https://console.aws.amazon.com/cloudformation/home?#/stacks/new?stackName=GenomicsVPC&templateURL=https://aws-quickstart.s3.amazonaws.com/quickstart-aws-vpc/templates/aws-vpc.template.yaml) to create a new AWS VPC. Make sure that the region on top right corner of the console page matches with your region of interest. Click on `Next` and then `Next` again. Agree to `Capabililties`. Click on `Create stack`.
+2. Choose available zones in `Availability Zones`. For example, if your region is `us-west-2`, then you will see `us-west-2a`, `us-west-2b` and  `us-west-2c`.
+
+
+## AWS Batch
 
+1. Click on [this](
+https://console.aws.amazon.com/cloudformation/home?#/stacks/new?stackName=gwfcore&templateURL=https://caper-aws-genomics-workflows.s3-us-west-2.amazonaws.com/templates/gwfcore/gwfcore-root.template.yaml) to create a new AWS Batch. Make sure that the region on top right corner of the console page matches with your region of interest. Click on `Next`.
+2. There are several required parameters to be specified on this page
+- `S3 Bucket name`: S3 bucket name to store your pipeline outputs. This is not a full path for the output directory. It's just bucket's name without the scheme prefix `s3://`. Make sure that this bucket doesn't exist. If it exists then delete it or try with a different non-existing bucket name.
+- `VPC ID`: Choose the VPC `GenomicsVPC` that you just created.
+- `VPC Subnet IDs`: Choose all private subnets created with the above VPC.
+3. Click on `Next` and then `Next` again. Agree to `Capabililties`. Click on `Create stack`.
+4. Go to your [AWS Batch](https://console.aws.amazon.com/batch) and click on `Job queues` in the left sidebar. Click on `default-*`. Get ARN for your batch under the key `Queue ARN`. This ARN will be used later to create Caper server instance.
 
-## How to create an instance (admin)
+
+## How to create a server instance
 
 Run without parameters to see detailed help.
 ```bash
@@ -22,7 +43,7 @@ Try with the positional arguments only first and see if it works.
 $ bash create_instance.sh [INSTANCE_NAME] [AWS_REGION] [PUBLIC_SUBNET_ID] [AWS_BATCH_ARN] [KEY_PAIR_NAME] [AWS_OUT_DIR]
 ```
 
-- `AWS_REGION`: Your AWS region. e.g. `us-east-1`. Make sure that it matches with `region` in your AWS credentials file `$HOME/.aws/credentials`.
+- `AWS_REGION`: Your AWS region. e.g. `us-west-2`. Make sure that it matches with `region` in your AWS credentials file `$HOME/.aws/credentials`.
 - `PUBLIC_SUBNET_ID`: Click on `Services` on AWS Console and Choose `VPC`. Click on `Subnets` on the left sidebar and find `Public subnet 1` under your VPC created from the above instruction.
 - `AWS_BATCH_ARN`: ARN of the AWS Batch created from the above instruction. Double-quote the whole ARN since it includes `:`.
 - `KEY_PAIR_NAME`: Click on `Services` on AWS Console and Choose `EC2`. Choose `Key Pairs` on the left sidebar and create a new key pair (in `.pem` format). Take note of the key name and keep the `.pem` key file on a secure directory where you want to SSH to the instance from. You will need it later when you SSH to the instancec.
@@ -56,7 +77,7 @@ $ cd /opt/caper
 $ screen -dmS caper_server bash -c "caper server > caper_server.log 2>&1"
 ```
 
-## How to stop Caper server (admin)
+## How to stop Caper server
 
 On the instance, attach to the existing screen `caper_server`, stop it with Ctrl + C.
 ```bash
@@ -65,7 +86,7 @@ $ screen -r caper_server # attach to the screen
 # in the screen, press Ctrl + C to send SIGINT to Caper
 ```
 
-## How to start Caper server (admin)
+## How to start Caper server
 
 On the instance, make a new screen `caper_server`.
 ```bash
@@ -74,7 +95,7 @@ $ cd /opt/caper
 $ screen -dmS caper_server bash -c "caper server > caper_server.log 2>&1"
 ```
 
-## How to submit workflow (user)
+## How to submit a workflow
 
 For the first log-in, authenticate yourself to get permission to read/write on the output S3 bucket. This is to localize any external URIs (defined in an input JSON) on the output S3 bucket's directory with suffix `.caper_tmp/`. Make sure that you have full permission on the output S3 bucket.
 ```bash
@@ -93,3 +114,57 @@ $ caper submit [WDL] -i input.json ...
 ```
 
 Caper will localize big data files on a S3 bucket directory `--aws-loc-dir` (or `aws-loc-dir` in the Caper conf file), which defaults to `[AWS_OUT_DIR]/.caper_tmp/` if not defined. e.g. your FASTQs and reference genome data defined in an input JSON.
+
+
+## Using S3 URIs in input JSON
+
+**VERY IMPORTANT!**
+
+Caper localizes input files on output S3 bucket path + `./caper_tmp` if they are given as non-S3 URIs (e.g. `gs://example/ok.txt`, `http://hello,com/a.txt`, `/any/absolute/path.txt`). However if S3 URIs are given in an input JSON then Caper will not localize them and will directly pass them to Cromwell. However, Cromwell is very picky about **region** and **permission**.
+
+First of all **PLEASE DO NOT USE ANY EXTERNAL S3 FILES OUT OF YOUR REGION**. Call-caching will not work for those external files. For example, if your Caper server resides on `us-west-2` and you want to use a Broad reference file `s3://broad-references/hg38/v0/Homo_sapiens_assembly38.dict`. All broad data are on `us-east-1` so call-caching will never work.
+
+Another example is ENCODE portal's file. [This FASTQ file](`https://www.encodeproject.org/files/ENCFF641SFZ/`) has a public S3 URI in metadata, which is `s3://encode-public/2017/01/27/92e9bb3b-bc49-43f4-81d9-f51fbc5bb8d5/ENCFF641SFZ.fastq.gz`. All ENCODE portal's data are on `us-west-2`. Call-caching will not work other regions. It's recommended to directly use the URL of this file `https://www.encodeproject.org/files/ENCFF641SFZ/@@download/ENCFF641SFZ.fastq.gz` in input JSON.
+
+**DO NOT USE S3 FILES ON A PRIVATE BUCKET**. Job instances will not have access to those private files even though the server instance has one (with your credentials configured with `aws configure`). For example, ENCODE portal's unreleased files are on a private bucket `s3://encode-priavte`. Jobs will always fail if you use these private files.
+
+If S3 files in an input JSON are public in the same region then check if you have `s3:GetObjectAcl` permission on the file.
+```bash
+$ aws s3api get-object-acl --bucket encode-public --key 2017/01/27/92e9bb3b-bc49-43f4-81d9-f51fbc5bb8d5/ENCFF641SFZ.fastq.gz
+{
+    "Owner": {
+        "DisplayName": "encode-data",
+        "ID": "50fe8c9d2e5e9d4db8f4fd5ff68ec949de9d4ca39756c311840523f208e7591d"
+    },
+    "Grants": [
+        {
+            "Grantee": {
+                "DisplayName": "encode-aws",
+                "ID": "a0dd0872acae5121b64b11c694371e606e28ab2e746e180ec64a2f85709eb0cd",
+                "Type": "CanonicalUser"
+            },
+            "Permission": "FULL_CONTROL"
+        },
+        {
+            "Grantee": {
+                "Type": "Group",
+                "URI": "http://acs.amazonaws.com/groups/global/AllUsers"
+            },
+            "Permission": "READ"
+        }
+    ]
+}
+```
+If you get `403 Permission denied` then call-caching will not work.
+
+To avoid all permission/region problems, please use non-S3 URIs/URLs.
+
+
+## References
+
+https://docs.opendata.aws/genomics-workflows/orchestration/cromwell/cromwell-overview.html
+
+
+## Troubleshooting
+
+See [this] for troubleshooting.
diff --git a/scripts/aws_caper_server/TROUBLESHOOTING.md b/scripts/aws_caper_server/TROUBLESHOOTING.md
@@ -0,0 +1,57 @@
+## Troubleshooting
+
+Run `caper debug WORKFLOW_ID` to debug/troubleshoot a workflow.
+
+
+### `Could not read from s3...`
+
+If you use private S3 URIs in an input JSON then you will see this error. Please don't use any private S3 URIs. Get a presigned HTTP URL of the private bucket file or use `~/.netrc` authentication instead.
+
+```javascript
+"failures": [
+    {
+        "causedBy": [
+            {
+                "causedBy": [
+                    {
+                        "message": "s3://s3.amazonaws.com/encode-processing/test_without_size_call/5826859d-d07c-4749-a2fe-802c6c6964a6/call-get_b/get_b-rc.txt",
+                        "causedBy": []
+                    }
+                ],
+                "message": "Could not read from s3://encode-processing/test_without_size_call/5826859d-d07c-4749-a2fe-802c6c6964a6/call-get_b/get_b-rc.txt: s3://s3.amazonaws.com/encode-processing/test_without_size_call/5826859d-d07c-4749-a2fe-802c6c6964a6/call-get_b/get_b-rc.txt"
+            }
+        ],
+        "message": "[Attempted 1 time(s)] - IOException: Could not read from s3://encode-processing/test_without_size_call/5826859d-d07c-4749-a2fe-802c6c6964a6/call-get_b/get_b-rc.txt: s3://s3.amazonaws.com/encode-processing/test_without_size_call/5826859d-d07c-4749-a2fe-802c6c6964a6/call-get_b/get_b-rc.txt"
+    }
+],
+```
+
+
+### `S3Exception: null (Service: S3, Status Code: 301)`
+
+If you use S3 URIs in an input JSON which are in a different region, then you will see `301 Error`. Please don't use S3 URIs out of your region. It's better to
+
+```javascript
+"callCaching": {
+    "hashFailures": [
+        {
+            "causedBy": [
+                {
+                    "message": "null (Service: S3, Status Code: 301, Request ID: null, Extended Request ID: MpqH6PrTGZwXu2x5pt8H38VWqnrpWWT7nzH/fZtbiEIKJkN9qrB2koEXlmXAYdvehvAfy5yQggE=)",
+                    "causedBy": []
+                }
+            ],
+            "message": "[Attempted 1 time(s)] - S3Exception: null (Service: S3, Status Code: 301, Request ID: null, Extended Request ID: MpqH6PrTGZwXu2x5pt8H38VWqnrpWWT7nzH/fZtbiEIKJkN9qrB2koEXlmXAYdvehvAfy5yQggE=)"
+        }
+    ],
+    "allowResultReuse": false,
+    "hit": false,
+    "result": "Cache Miss",
+    "effectiveCallCachingMode": "CallCachingOff"
+}
+```
+
+
+### `S3Exception: null (Service: S3, Status Code: 400)`
+
+If you see `400` error then please use this shell script `./create_instance.sh` to create an instance instead of running Caper server on your laptop/machine.
diff --git a/scripts/gcp_caper_server/README.md b/scripts/gcp_caper_server/README.md
@@ -4,7 +4,7 @@
 
 > **NOTE**: Google Cloud Life Sciences API is a new API replacing the old deprecating Genomics API (`v2alpha1`). It requires `--gcp-region` to be defined correctly. Check [supported regions](https://cloud.google.com/life-sciences/docs/concepts/locations) for the new API.
 
-## Requirements
+## Install Google Cloud SDK SLI
 
 Make sure that `gcloud` (Google Cloud SDK CLI) is installed on your system.
 
@@ -17,35 +17,13 @@ Go to [APIs & Services](https://console.cloud.google.com/apis/dashboard) on your
 Go to [Service accounts](https://console.cloud.google.com/iam-admin/serviceaccounts) on your project and create a new service account with the following roles:
 * Compute Admin
 * Storage Admin: You can skip this and individually configure permission on each bucket on the project.
-* Cloud Life Sciences Admin
+* Cloud Life Sciences Admin (Cromwell's PAPI v2beta)
 * **Service Account User** (VERY IMPORTANT).
 
 Generate a secret key JSON from the service account and keep it locally on your computer.
 
 > **WARNING**: Such secret JSON file is a master key for important resources on your project. Keep it secure at your own risk. This file will be used for Caper so that it will be trasnferred to the created instance at `/opt/caper/service_account_key.json` visible to all users on the instance.
 
-## Troubleshooting errors
-
-If you see permission errors check if the above roles are correctly configured for your service account.
-
-If you see PAPI errors and Google's HTTP endpoint deprecation warning. Remove Life Sciences API role from your service account and add it back.
-
-If you see the following error then click on your service account on `Service Account` in `IAM` of your Google project and make sure that `Enable G Suite Domain-wide Delegation` is checked.
-```
-400 Bad Request
-POST https://lifesciences.googleapis.com/v2beta/projects/99884963860/locations/us-central1/operations/XXXXXXXXXXXXXXXXXXXX:cancel
-{
-  "code" : 400,
-  "errors" : [ {
-    "domain" : "global",
-    "message" : "Precondition check failed.",
-    "reason" : "failedPrecondition"
-  } ],
-  "message" : "Precondition check failed.",
-  "status" : "FAILED_PRECONDITION"
-}
-```
-
 ## How to create an instance
 
 Run without arguments to see detailed help. Some optional arguments are very important depending on your region/zone. e.g. `--gcp-region` (for provisioning worker instances of Life Sciences API) and `--zone` (for server instance creation only). These regional parameters default to US central region/zones.
@@ -109,3 +87,8 @@ If users want to have their own configuration at `~/.caper/default.conf`, simply
 $ rm ~/.caper/default.conf
 $ cp /opt/caper/default.conf ~/.caper/default.conf
 ```
+
+
+## Troubleshooting
+
+See [this] for troubleshooting.
diff --git a/scripts/gcp_caper_server/TROUBLESHOOTING.md b/scripts/gcp_caper_server/TROUBLESHOOTING.md
@@ -0,0 +1,21 @@
+## Troubleshooting errors
+
+If you see permission errors check if the above roles are correctly configured for your service account.
+
+If you see PAPI errors and Google's HTTP endpoint deprecation warning. Remove Life Sciences API role from your service account and add it back.
+
+If you see the following error then click on your service account on `Service Account` in `IAM` of your Google project and make sure that `Enable G Suite Domain-wide Delegation` is checked.
+```
+400 Bad Request
+POST https://lifesciences.googleapis.com/v2beta/projects/99884963860/locations/us-central1/operations/XXXXXXXXXXXXXXXXXXXX:cancel
+{
+  "code" : 400,
+  "errors" : [ {
+    "domain" : "global",
+    "message" : "Precondition check failed.",
+    "reason" : "failedPrecondition"
+  } ],
+  "message" : "Precondition check failed.",
+  "status" : "FAILED_PRECONDITION"
+}
+```