Skip to content

Commit

Permalink
Merge pull request #98 from ENCODE-DCC/PIP-1432_auto_write_metadata
Browse files Browse the repository at this point in the history
Pip 1432 auto write metadata
  • Loading branch information
leepc12 authored Nov 4, 2020
2 parents 5799d0d + e0eba61 commit 18b1f27
Show file tree
Hide file tree
Showing 12 changed files with 72 additions and 119 deletions.
2 changes: 1 addition & 1 deletion DETAILS.md
Original file line number Diff line number Diff line change
Expand Up @@ -155,7 +155,7 @@ We highly recommend to use a default configuration file described in the section
--file-db, -d|File-based metadata DB for Cromwell's built-in HyperSQL database (UNSTABLE)
--db-timeout|Milliseconds to wait for DB connection (default: 30000)
--java-heap-server|Java heap memory for caper server (default: 10G)
--disable-auto-update-metadata| Disable auto update/retrieval/writing of `metadata.json` on workflow's output directory.
--disable-auto-write-metadata| Disable auto update/retrieval/writing of `metadata.json` on workflow's output directory.
--java-heap-run|Java heap memory for caper run (default: 3G)
--show-subworkflow|Include subworkflow in `caper list` search query. **WARNING**: If there are too many subworkflows, then you will see HTTP 503 error (service unavaiable) or Caper/Cromwell server can crash.

Expand Down
66 changes: 0 additions & 66 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,71 +1,5 @@
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black) [![CircleCI](https://circleci.com/gh/ENCODE-DCC/caper.svg?style=svg)](https://circleci.com/gh/ENCODE-DCC/caper)

# Major changes for Caper 1.0.

If you are upgrading Caper from previous versions:
- Edit your `~/.caper/default.conf` to remove `cromwell=` and `womtool=` from it then Caper will automatically download Cromwell/Womtool version 51, which support new Google Cloud Life Sciences API (v2beta). You can also use `caper init [YOUR_BACKEND]` to locally install Cromwell/Womtool JARs.

> **CRITICAL**: Due to change in Caper 1.0 (Cromwell `47` to `51`), metadata database (`--db`) generated before 1.0 will not work with >= 1.0. See details below.
Upgraded Cromwell from 47 to 51.
- Metadata DB generated with Caper<1.0 will not work with Caper>=1.0.
- See [this note](https://github.com/broadinstitute/cromwell/releases/tag/49) to find DB migration instruction.
- We recommend to use Cromwell-51 with Caper>=1.0 since it's fully test with Cromwell-51.

Changed hashing strategy for all local backends (`local`, `slurm`, `sge`, `pbs`).
- Default hashing strategy: `file` (based on md5sum, which is expensive) to `path+modtime`.
- Changing hashing strategy and using the same metadata DB will result in cache-miss.

Changed duplication strategy for all local backends (`local`, `slurm`, `sge`, `pbs`).
- Default file duplication strategy: `hard-link` to `soft-link`.
- For filesystems (e.g. beeGFS) that do not allow hard-linking.
- Caper<1.0 hard-linked input files even with `--soft-glob-output`.
- For Caper>=1.0, you still need to use `--soft-glob-output` for such filesystems.

Google Cloud Platform backend (`gcp`):
- Cau use a service account instead of an application default (end user's auth.).
- Added `--gcp-service-account-key-json`.
- Make sure that such service account has enough permission (roles) to resources on Google Cloud Platform project (`--gcp-prj`). See [details](docs/conf_gcp.md#how-to-run-caper-with-a-service-account).
- Can use Google Cloud Life Sciences API (v2beta) instead of deprecating Google Cloud Genomics API (v2alpha1).
- Added `--use-google-cloud-life-sciences`.
- For `caper server/run`, you need to specify a region `--gcp-region` to use Life Sciences API. Check [supported regions](https://cloud.google.com/life-sciences/docs/concepts/locations). `--gcp-zones` will be ignored.
- Make sure to enable `Google Cloud Life Sciences API` on Google Cloud Platform console (APIs & Services -> `+` button on top).
- Also if you use a service account then add a role `Life Sciences Admin` to your service account.
- We will deprecate old `Genomics API` support. `Life Sciences API` will become a new default after next 2-3 releases.
- Added [`memory-retry`](https://cromwell.readthedocs.io/en/stable/backends/Google/) to Caper. This is for `gcp` backend only.
- Retries (controlled by `--max-retries`) on an instance with increased memory if workflow fails due to OOM (out-of-memory) error.
- Comma-separated keys to catch OOM: `--gcp-prj-memory-retry-error-keys`.
- Multiplier for every retrial due to OOM: `--gcp-prj-memory-retry-multiplier`.

Change of parameter names. Backward compatible.
- `--out-dir` -> `--local-out-dir`
- `--out-gcs-bucket` -> `--gcp-out-dir`
- `--out-s3-bucket` -> `--aws-out-dir`
- `--tmp-dir` -> `--local-loc-dir`
- `--tmp-gcs-bucket` -> `--gcp-loc-dir`
- `--tmp-s3-bucket` -> `--aws-loc-dir`

Added parameters
- `--use-google-cloud-life-sciences` and `--gcp-region`: Use Life Sciences API (Cromwell's v2beta scheme).
- `--gcp-service-account-key-json`: Use a service account for auth on GCP (instead of application default).
- `--gcp-prj-memory-retry-error-keys`: Comma-separated keys to catch OOM error on GCP.
- `--gcp-prj-memory-retry-multiplier`: Multiplier for every retrial due to OOM error on GCP.
- `--cromwell-stdout`: Redirect Cromwell STDOUT to file.

Improved Python interface.
- Old Caper<1.0 was originally designed for CLI.
- New Caper>=1.0 is designed for Python interface first and then CLI is based on such Python interface.
- Can retrieve `metadata.json` embedded with subworkflows' metadata JSON.

Better logging and troubleshooting.
- Defaults to write Cromwell STDOUT to `cromwell.out` (controlled by `--cromwell-stdout`).


> **IMPORTANT**: `--use-gsutil-for-s3` requires `gsutil` installed on your system. This flag allows a direct transfer between `gs://` and `s3://`. This requires `gsutil` >= 4.47. See this [issue](https://github.com/GoogleCloudPlatform/gsutil/issues/935) for details. `gsutil` is based on Python 2.
```bash
$ pip install gsutil --upgrade
```

# Caper

Caper (Cromwell Assisted Pipeline ExecutoR) is a wrapper Python package for [Cromwell](https://github.com/broadinstitute/cromwell/).
Expand Down
2 changes: 1 addition & 1 deletion caper/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,4 @@
from .caper_runner import CaperRunner

__all__ = ['CaperClient', 'CaperClientSubmit', 'CaperRunner']
__version__ = '1.4.1'
__version__ = '1.4.2'
7 changes: 7 additions & 0 deletions caper/backward_compatibility.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,3 +10,10 @@
'tmp_s3_bucket': 'aws_loc_dir',
'ip': 'hostname',
}

CAPER_1_4_2_PARAM_KEY_NAME_CHANGE = {'auto_update_metadata': 'auto_write_metadata'}

PARAM_KEY_NAME_CHANGE = {
**CAPER_1_0_0_PARAM_KEY_NAME_CHANGE,
**CAPER_1_4_2_PARAM_KEY_NAME_CHANGE,
}
8 changes: 3 additions & 5 deletions caper/caper_args.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
from autouri import URIBase

from .arg_tool import update_parsers_defaults_with_conf
from .backward_compatibility import CAPER_1_0_0_PARAM_KEY_NAME_CHANGE
from .backward_compatibility import PARAM_KEY_NAME_CHANGE
from .caper_workflow_opts import CaperWorkflowOpts
from .cromwell import Cromwell
from .cromwell_backend import (
Expand Down Expand Up @@ -533,7 +533,7 @@ def get_parser_and_defaults(conf_file=None):
help='Cromwell Java heap size for "server" mode (java -Xmx)',
)
parent_server.add_argument(
'--disable-auto-update-metadata',
'--disable-auto-write-metadata',
action='store_true',
help='Disable automatic retrieval/update/writing of metadata.json upon workflow/task status change.',
)
Expand Down Expand Up @@ -859,9 +859,7 @@ def get_parser_and_defaults(conf_file=None):
]
if os.path.exists(conf_file):
conf_dict = update_parsers_defaults_with_conf(
parsers=subparsers,
conf_file=conf_file,
conf_key_map=CAPER_1_0_0_PARAM_KEY_NAME_CHANGE,
parsers=subparsers, conf_file=conf_file, conf_key_map=PARAM_KEY_NAME_CHANGE
)
else:
conf_dict = None
Expand Down
6 changes: 3 additions & 3 deletions caper/caper_runner.py
Original file line number Diff line number Diff line change
Expand Up @@ -451,7 +451,7 @@ def server(
fileobj_stdout=None,
embed_subworkflow=False,
java_heap_server=Cromwell.DEFAULT_JAVA_HEAP_CROMWELL_SERVER,
auto_update_metadata=True,
auto_write_metadata=True,
work_dir=None,
dry_run=False,
):
Expand Down Expand Up @@ -486,7 +486,7 @@ def server(
This is to mimic behavior of Cromwell run mode's -m parameter.
java_heap_server:
Java heap (java -Xmx) for Cromwell server mode.
auto_update_metadata:
auto_write_metadata:
Automatic retrieval/writing of metadata.json upon workflow/task's status change.
work_dir:
Local temporary directory to store all temporary files.
Expand Down Expand Up @@ -518,7 +518,7 @@ def server(
fileobj_stdout=fileobj_stdout,
embed_subworkflow=embed_subworkflow,
java_heap_cromwell_server=java_heap_server,
auto_update_metadata=auto_update_metadata,
auto_write_metadata=auto_write_metadata,
dry_run=dry_run,
)
return th
2 changes: 1 addition & 1 deletion caper/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -317,7 +317,7 @@ def subcmd_server(caper_runner, args, nonblocking=False):
'server_heartbeat': sh,
'custom_backend_conf': get_abspath(args.backend_file),
'embed_subworkflow': True,
'auto_update_metadata': not args.disable_auto_update_metadata,
'auto_write_metadata': not args.disable_auto_write_metadata,
'java_heap_server': args.java_heap_server,
'dry_run': args.dry_run,
}
Expand Down
6 changes: 3 additions & 3 deletions caper/cromwell.py
Original file line number Diff line number Diff line change
Expand Up @@ -326,7 +326,7 @@ def server(
fileobj_stdout=None,
embed_subworkflow=False,
java_heap_cromwell_server=DEFAULT_JAVA_HEAP_CROMWELL_SERVER,
auto_update_metadata=True,
auto_write_metadata=True,
on_server_start=None,
on_status_change=None,
cwd=None,
Expand Down Expand Up @@ -365,7 +365,7 @@ def server(
This is to mimic behavior of Cromwell run mode's -m parameter.
java_heap_cromwell_server:
Java heap (java -Xmx) for Cromwell server mode.
auto_update_metadata:
auto_write_metadata:
Automatic retrieval/writing of metadata.json upon workflow/task's status change.
on_server_start:
On server start.
Expand Down Expand Up @@ -429,7 +429,7 @@ def server(
server_port=server_port,
is_server=True,
embed_subworkflow=embed_subworkflow,
auto_update_metadata=auto_update_metadata,
auto_write_metadata=auto_write_metadata,
on_server_start=on_server_start,
on_status_change=on_status_change,
)
Expand Down
Loading

0 comments on commit 18b1f27

Please sign in to comment.