Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

first draft diirm doc #211

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
116 changes: 116 additions & 0 deletions content/resources/user/diirm-submission.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
---
title: "Gen3 - DIIRM Submission"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I honestly don't like the name DIIRM (not sure how Bob feels about it either at this point). I think we should just call this Gen3 Data Ingestion.

date: 2018-09-12T20:53:20-05:00
linktitle: /resources/user
layout: withtoc
menuname: userMenu
---
{{% markdownwrapper %}}
# DIIRM Submission of Data Files
* * *

The following guide details the steps a data contributor must take to submit project data to a Gen3 data commons with the Data, Ingest, Index, Resource Management (DIIRM) system.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Data indexing, ingestion and release management

The goal of DIIRM is to expedite the ingestion and hosting of data in cloud buckets through a series of streamlined, automated and systematic steps that adhere to strict QA tests. In order to support the many advantages of using Gen3’s standard tooling for DIIRM, data needs to first be organized and copied to cloud buckets following the guidelines detailed below.

* * *

## 1. Prepare Project with the Gen3 sdk tools

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## 1. Prepare Project with the Gen3 sdk tools
## 1. Prepare Project with the Gen3 SDK tools

* * *
In order to submit data files, a Gen3 project must be present to associate the files to. The [Gen3 Submission sdk](https://uc-cdis.github.io/gen3sdk-python/_build/html/_modules/gen3/submission.html) has a comprehesive set of tools to enable users to script submission of projects.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not true in DIIRM (and something important to point out). You don't need the graph to use most of the SDK code for data ingestion

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is important b/c some projects may only use our Framework Services and not have a full Gen3 Data Commons with a graph


Sample Code for submission of a Project to a data commons:
```
sample code
```

* * *
## 2. Upload files to Object Storage with Cloud Resource Command Line Interface
* * *

Data can be submitted to a separate cloud resource as long as requirements for access and authorization are met. In order to support the many advantages of using Gen3’s standard tooling for DIIRM, data needs to first be organized and copied to cloud buckets following the guidelines detailed below.

### Data and Access Considerations

The recommended (and simplest) way for Gen3 to provide controlled access to data is via Signed URLs. Signed URLs are the only fully cloud-agnostic method supported by Gen3 and additionally are supported by all major cloud resource providers. They also allow for short-lived, temporary access to the data for reduced risk. Lastly, utilizing signed URLs places very few restrictions on the organization of data within could bucket(s).

### Allocating Data in Buckets Based on User Access

Gen3 has the capability to grant access granularity at the project level designation only. In this way data in a particular bucket should only associated with a particular user access.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically this requirement for separate buckets is for Google only, but since it's best to have Google and AWS match, we've always push for separation in both. Might be worth noting here since if it's all AWS data, it could live in a single bucket


#### Bucket Allocation Example:
A user’s authorization may look something like:

A user has read access to phs001416.c1, phs001416.c2, phs000974.c2

The data in buckets could be separated by phsid+consent code combinations (as this is the smallest granularity of data access required).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consent groups and phsids are dbGaP specific constructs. We should clarify and/or try to use something more general. Many people using Gen3 won't be using dbGaP


The following bucket structure supports the ingestion of dbGaP’s MESA and FHS projects (from TOPMed). Each project has 2 distinct consent groups, and the data is mirrored on both AWS and Google buckets.

TOPMed-MESA (phs001416.c1 and .c2)
TOPMed-FHS (phs000974.c1 and .c2)

| Project | AWS buckets | Google buckets |
| --- | --- | --- |
| MESA (consent group 1) | s3://nih-nhlbi-topmed-released-phs001416-c1 | gs://nih-nhlbi-topmed-released-phs001416-c1 |
| MESA (consent group 2) | s3://nih-nhlbi-topmed-released-phs001416-c2 | gs://nih-nhlbi-topmed-released-phs001416-c2 |
| FHS (consent group 1) | s3://nih-nhlbi-topmed-released-phs000974-c1 | gs://nih-nhlbi-topmed-released-phs000974-c1 |
| FHS (consent group 2) | s3://nih-nhlbi-topmed-released-phs000974-c2 | gs://nih-nhlbi-topmed-released-phs000974-c2 |

With a setup similar to this, Gen3 is able to support signed URLs and fully configured end-user access.


### Bucket Population

Once a data allocation scheme is determined, data can be uploaded accordingly to cloud buckets. It should be noted that while Amazon AWS and Google are the most supported cloud providers, Gen3 is cloud agnostic. Any method and hierarchy structure can be used for upload as long as a the same parent directory is maintained with end user access.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Once a data allocation scheme is determined, data can be uploaded accordingly to cloud buckets. It should be noted that while Amazon AWS and Google are the most supported cloud providers, Gen3 is cloud agnostic. Any method and hierarchy structure can be used for upload as long as a the same parent directory is maintained with end user access.
Once a data allocation scheme is determined, data can be uploaded accordingly to cloud buckets. It should be noted that while Amazon AWS and Google are the most supported cloud providers, Gen3 has community-contributed code for Azure storage support as well. Any method and hierarchy structure can be used for upload as long as a the same parent directory is maintained with end user access.


* * *
## 3. Create Bucket Mapping File
* * *

At a minimum Gen3 requires a bucket to acl or authz mapping file. Both acl or authz can be used interchangably. This file is needed to provide end user access to whole buckets based on their RAS provided credientials.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have an opinion here and just require the use of authz


Example of a bucket mapping file
| bucket | authz |
| --- | --- |
| s3://nih-nhlbi-topmed-released-phs001416-c1 | phs001416.c1 |
| gs://nih-nhlbi-topmed-released-phs001416-c1 | phs001416.c1 |
| s3://nih-nhlbi-topmed-released-phs001416-c2 | phs001416.c2 |
| gs://nih-nhlbi-topmed-released-phs001416-c2 | phs001416.c2 |
| s3://nih-nhlbi-topmed-released-phs000974-c1 | phs000974.c1 |
| gs://nih-nhlbi-topmed-released-phs000974-c1 | phs000974.c1 |
| s3://nih-nhlbi-topmed-released-phs000974-c2 | phs000974.c2 |
| gs://nih-nhlbi-topmed-released-phs000974-c2 | phs000974.c2 |

The preffered option for this file is a file level mapping file that contains additional columns (up to 6).
| File_name | File_size | md5sum | bucket_urls | acl | authz |
| --- | --- | --- | --- | --- | --- |
| examplefile.txt | 123456 | sample_md5 | s3://nih-nhlbi-topmed-released-phs001416-c1/example-bucket/examplefile.txt gs://nih-nhlbi-topmed-released-phs001416-c1/example-bucket/examplefile.txt | [phs001416,c1] | [phs001416.c1] |
| otherexamplefile.txt | 123456 | different_md5 | s3://nih-nhlbi-topmed-released-phs001416-c1/example-bucket/otherexamplefile.txt gs://nih-nhlbi-topmed-released-phs001416-c1/example-bucket/otherexamplefile.txt | [phs001416,c1] | [phs001416.c1] |
| examplefile.txt | 123456 | sample_md5 | s3://nih-nhlbi-topmed-released-phs001416-c2/example-bucket/examplefile.txt gs://nih-nhlbi-topmed-released-phs001416-c2/example-bucket/examplefile.txt | [phs001416,c2] | [phs001416.c2] |
| otherexamplefile.txt | 123456 | different_md5 | s3://nih-nhlbi-topmed-released-phs001416-c2/example-bucket/otherexamplefile.txt gs://nih-nhlbi-topmed-released-phs001416-c2/example-bucket/otherexamplefile.txt | [phs001416,c2] | [phs001416.c2] |

Gen3 expects a bucket mapping file, which (at a minimum) must include the names of all the buckets and an indication as to which cloud they’re in. In the situation where Gen3 must support cloud-specific data access methods, Gen3 also requires the authz column (which should contain the granular access control which would represent access to the entire bucket).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an indication as to which cloud they’re in

This can be inferred by the gs and s3 so as long as those are there, we're good. Might be good to clarify that as this seems to imply the example is missing another column.

In the situation where Gen3 must support cloud-specific data access methods, Gen3 also requires the authz column (which should contain the granular access control which would represent access to the entire bucket).

This is somewhat confusingly worded. I think what this really means is: If you want the bucket to require controlled access, put the required authz in there.




Creation of a file Indexing Manifest
```
Sample code from sdk
```

* * *
## 4. Submit file Indexing Manifest to Indexd
* * *



Gen3 offers an [Indexing sdk toolkit](https://uc-cdis.github.io/gen3sdk-python/_build/html/tools/indexing.html) to build, validate and map all files into a Gen3 datacommons.

This file should offer meta data as well as bucket mapping.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This file should offer meta data as well as bucket mapping.
This file should offer metadata as well as bucket mapping.


| File_name | File_size | md5sum | bucket_urls | acl |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

guid | size | md5 | urls | acl | authz

| --- | --- | --- | --- | --- |
| examplefile.txt | 123456 | sample_md5 | s3://example-bucket/examplefile.txt gs://example-bucket/examplefile.txt | [phs000001,c1] |

* * *
To continue your data submission return to the main [Gen3 - Data Contribution](https://gen3.org/resources/user/submit-data/#4-submit-additional-project-metadata) page.\n
116 changes: 116 additions & 0 deletions content/resources/user/gui-submission.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
---
title: "Gen3 - GUI Submission"
date: 2018-09-12T20:53:20-05:00
linktitle: /resources/user
layout: withtoc
menuname: userMenu
---
{{% markdownwrapper %}}
# GUI Submission of Data Files
* * *

The following guide details the steps a data contributor must take to submit a project to a Gen3 data commons with the Graphical User Interface (GUI).

* * *

## 1. Prepare Project in Submission Portal
* * *
<!--
This section could be removed if we (semi)automate CMC creation for users
-->

In order to upload data files, at least one record in the `core_metadata_collection` node must exist. If your project already has at least one record in this node, you can skip to step 2 below.

Do the following to create your first `core_metadata_collection` record:

1. Go to your data commons' submission portal website
2. Click on 'Submit Data'
3. Find your project in the list of Projects and click 'Submit Data'
4. Click 'Use Form Submission' and choose `core_metadata_collection` from the dropdown list (or [edit and upload this TSV](gen3_core_metadata_collection_template.tsv) by clicking 'Upload File' then 'Submit')

![node_dropdown.png](node_dropdown.png)

![cmc_form.png](cmc_form.png)


5. Fill in the required information (see note below)
6. Click 'Upload submission json from form' and then 'Submit'
7. Make note of the `submitter_id` of your `core_metadata_collection` record for step 3 below

>__Note:__ Minimally, `submitter_id` and `projects.code` are required properties. The project `code` is the name of your project without the "program-" prefix. For example, if your project URL is https://gen3.datacommons.io/example-training, your project's `code` would be 'training', the `program` would be 'example', and your `project_id` would be the combination: 'example-training'.

You should have received the message:

```
succeeded: 200
Successfully created entities: 1 of core_metadata_collection
```

If you received any other message, then check the 'Details' to help determine the error.

To view the records in the `core_metadata_collection` node in your project, you can go to:
https://gen3.datacommons.io/example-training/search?node_type=core_metadata_collection
(replacing the first part of that URL with the URL of your actual project).

## 2. Upload Data Files to Object Storage
* * *

Data files such as spreadsheets, sequencing data (BAM, FASTQ), assay results, images, PDFs, etc., are uploaded to object storage with the [gen3-client command-line tool](/resources/user/gen3-client).

>__Note:__ if your data files are already located in cloud storage, such as an AWS or GCS bucket, please see [this page](/resources/user/submit-data/sower) on how to make these files available in a Gen3 data commons.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OR if you need to support Google.

I think we need a larger disclaimer that this existing Upload Data Files method is very limited in scalability and doesn't work with Google. imo we should push people to use the other method entirely, b/c ideally (once there's better tooling and docs) we should remove the old method to clean up our stack


1. Download the latest [compiled binary](https://github.com/uc-cdis/cdis-data-client/releases/latest) for your operating system.
2. Configure a profile with credentials downloaded from your Profile:

```
./gen3-client configure --profile=<profile_name> --cred=<credentials.json> --apiendpoint=<api_endpoint_url>

```
3. Upload Files: single data file, a directory of files, or matching files:

```
./gen3-client upload --profile=<profile_name> --upload-path=~/files/example.txt

./gen3-client upload --profile=<profile_name> --upload-path=~/files/

./gen3-client upload --profile=<profile_name> --upload-path=~/files/*.txt

```

For detailed instructions on configuring and using the gen3-client, visit the [Gen3 client documentation](/resources/user/gen3-client).

## 3. Map Uploaded Files to a Data File Node
* * *

Once data files are successfully uploaded, the files must be mapped to the appropriate node in the data model before they're accessible to authorized users.

1. Go to your data commons submission portal website.

2. Click 'Submit Data'.

![submit-data.png](submit-data.png)

3. Click 'Map My Files' button.

![map_my_files.png](map-my-files.png)

4. Select the files to map using the checkboxes and click 'Map Files' button.

![select-files.png](select-files.png)

5. Select the project and node that the files belong to.

![map-to-node.png](map-to-node.png)

6. Fill in the values of any required properties and click 'Submit' button.

![fill-required-properties.png](fill-required-properties.png)


> __Note:__ The required property 'Type' in step 6 is the node's name (the 'type' of node) and should be the same as the value selected from the node dropdown list in step 5.

You should receive the message "# files mapped successfully!" upon success.

* * *

To continue your data submission return to the main [Gen3 - Data Contribution](/resources/user/submit-data/#4-submit-additional-project-metadata) page.
Original file line number Diff line number Diff line change
@@ -1 +1 @@
type project_id submitter_id projects.code contributor coverage creator data_type date description format language publisher relation rights source subject titlecore_metadata_collection example-training collection-01 training
type project_id submitter_id projects.code contributor coverage creator data_type date description format language publisher relation rights source subject titlecore_metadata_collection example-training collection-01 training
Expand Down
Loading