uc-cdis · DanBiber · Dec 15, 2023 · Dec 15, 2023 · Dec 15, 2023 · Avantol13
@@ -0,0 +1,116 @@
+---
+title: "Gen3 - DIIRM Submission"
+date: 2018-09-12T20:53:20-05:00
+linktitle: /resources/user
+layout: withtoc
+menuname: userMenu
+---
+{{% markdownwrapper %}}
+# DIIRM Submission of Data Files
+* * *
+
+The following guide details the steps a data contributor must take to submit project data to a Gen3 data commons with the Data, Ingest, Index, Resource Management (DIIRM) system.
+The goal of DIIRM is to expedite the ingestion and hosting of data in cloud buckets through a series of streamlined, automated and systematic steps that adhere to strict QA tests. In order to support the many advantages of using Gen3’s standard tooling for DIIRM, data needs to first be organized and copied to cloud buckets following the guidelines detailed below.
+
+* * *
+
+## 1. Prepare Project with the Gen3 sdk tools
-## 1. Prepare Project with the Gen3 sdk tools
+## 1. Prepare Project with the Gen3 SDK tools
-## 1. Prepare Project with the Gen3 sdk tools
+## 1. Prepare Project with the Gen3 SDK tools
+* * *
+In order to submit data files, a Gen3 project must be present to associate the files to. The [Gen3 Submission sdk](https://uc-cdis.github.io/gen3sdk-python/_build/html/_modules/gen3/submission.html) has a comprehesive set of tools to enable users to script submission of projects.
+
+Sample Code for submission of a Project to a data commons:
+```
+sample code
+```
+
+* * *
+## 2. Upload files to Object Storage with Cloud Resource Command Line Interface
+* * *
+
+Data can be submitted to a separate cloud resource as long as requirements for access and authorization are met.  In order to support the many advantages of using Gen3’s standard tooling for DIIRM, data needs to first be organized and copied to cloud buckets following the guidelines detailed below.
+
+### Data and Access Considerations
+
+The recommended (and simplest) way for Gen3 to provide controlled access to data is via Signed URLs. Signed URLs are the only fully cloud-agnostic method supported by Gen3 and additionally are supported by all major cloud resource providers. They also allow for short-lived, temporary access to the data for reduced risk.  Lastly, utilizing signed URLs places very few restrictions on the organization of data within could bucket(s).
+
+### Allocating Data in Buckets Based on User Access
+
+Gen3 has the capability to grant access granularity at the project level designation only.  In this way data in a particular bucket should only associated with a particular user access.
+
+#### Bucket Allocation Example:
+A user’s authorization may look something like:
+
+A user has read access to phs001416.c1, phs001416.c2, phs000974.c2
+
+The data in buckets could be separated by phsid+consent code combinations (as this is the smallest granularity of data access required).
+
+The following bucket structure supports the ingestion of dbGaP’s MESA and FHS projects (from TOPMed). Each project has 2 distinct consent groups, and the data is mirrored on both AWS and Google buckets.
+
+TOPMed-MESA (phs001416.c1 and .c2)
+TOPMed-FHS (phs000974.c1 and .c2)
+
+| Project | AWS buckets | Google buckets |
+| --- | --- | --- |
+| MESA (consent group 1) | s3://nih-nhlbi-topmed-released-phs001416-c1 | gs://nih-nhlbi-topmed-released-phs001416-c1 |
+| MESA (consent group 2) | s3://nih-nhlbi-topmed-released-phs001416-c2 | gs://nih-nhlbi-topmed-released-phs001416-c2 |
+| FHS (consent group 1) | s3://nih-nhlbi-topmed-released-phs000974-c1 | gs://nih-nhlbi-topmed-released-phs000974-c1 |
+| FHS (consent group 2) | s3://nih-nhlbi-topmed-released-phs000974-c2 | gs://nih-nhlbi-topmed-released-phs000974-c2 |
+
+With a setup similar to this, Gen3 is able to support signed URLs and fully configured end-user access.
+
+
+### Bucket Population
+
+Once a data allocation scheme is determined, data can be uploaded accordingly to cloud buckets.  It should be noted that while Amazon AWS and Google are the most supported cloud providers, Gen3 is cloud agnostic.  Any method and hierarchy structure can be used for upload as long as a the same parent directory is maintained with end user access.
-Once a data allocation scheme is determined, data can be uploaded accordingly to cloud buckets.  It should be noted that while Amazon AWS and Google are the most supported cloud providers, Gen3 is cloud agnostic.  Any method and hierarchy structure can be used for upload as long as a the same parent directory is maintained with end user access.
+Once a data allocation scheme is determined, data can be uploaded accordingly to cloud buckets.  It should be noted that while Amazon AWS and Google are the most supported cloud providers, Gen3 has community-contributed code for Azure storage support as well.  Any method and hierarchy structure can be used for upload as long as a the same parent directory is maintained with end user access.
-Once a data allocation scheme is determined, data can be uploaded accordingly to cloud buckets.  It should be noted that while Amazon AWS and Google are the most supported cloud providers, Gen3 is cloud agnostic.  Any method and hierarchy structure can be used for upload as long as a the same parent directory is maintained with end user access.
+Once a data allocation scheme is determined, data can be uploaded accordingly to cloud buckets.  It should be noted that while Amazon AWS and Google are the most supported cloud providers, Gen3 has community-contributed code for Azure storage support as well.  Any method and hierarchy structure can be used for upload as long as a the same parent directory is maintained with end user access.
+
+* * *
+## 3. Create Bucket Mapping File
+* * *
+
+At a minimum Gen3 requires a bucket to acl or authz mapping file.  Both acl or authz can be used interchangably. This file is needed to provide end user access to whole buckets based on their RAS provided credientials.
+
+Example of a bucket mapping file
+| bucket | authz |
+| --- | --- |
+| s3://nih-nhlbi-topmed-released-phs001416-c1 | phs001416.c1 |
+| gs://nih-nhlbi-topmed-released-phs001416-c1 | phs001416.c1 |
+| s3://nih-nhlbi-topmed-released-phs001416-c2 | phs001416.c2 |
+| gs://nih-nhlbi-topmed-released-phs001416-c2 | phs001416.c2 |
+| s3://nih-nhlbi-topmed-released-phs000974-c1 | phs000974.c1 |
+| gs://nih-nhlbi-topmed-released-phs000974-c1 | phs000974.c1 |
+| s3://nih-nhlbi-topmed-released-phs000974-c2 | phs000974.c2 |
+| gs://nih-nhlbi-topmed-released-phs000974-c2 | phs000974.c2 |
+
+The preffered option for this file is a file level mapping file that contains additional columns (up to 6).
+| File_name | File_size | md5sum | bucket_urls | acl | authz |
+| --- | --- | --- | --- | --- | --- |
+| examplefile.txt | 123456 | sample_md5 | s3://nih-nhlbi-topmed-released-phs001416-c1/example-bucket/examplefile.txt gs://nih-nhlbi-topmed-released-phs001416-c1/example-bucket/examplefile.txt | [phs001416,c1] | [phs001416.c1] |
+| otherexamplefile.txt | 123456 | different_md5 | s3://nih-nhlbi-topmed-released-phs001416-c1/example-bucket/otherexamplefile.txt gs://nih-nhlbi-topmed-released-phs001416-c1/example-bucket/otherexamplefile.txt | [phs001416,c1] | [phs001416.c1] |
+| examplefile.txt | 123456 | sample_md5 | s3://nih-nhlbi-topmed-released-phs001416-c2/example-bucket/examplefile.txt gs://nih-nhlbi-topmed-released-phs001416-c2/example-bucket/examplefile.txt | [phs001416,c2] | [phs001416.c2] |
+| otherexamplefile.txt | 123456 | different_md5 | s3://nih-nhlbi-topmed-released-phs001416-c2/example-bucket/otherexamplefile.txt gs://nih-nhlbi-topmed-released-phs001416-c2/example-bucket/otherexamplefile.txt | [phs001416,c2] | [phs001416.c2] |
+
+Gen3 expects a bucket mapping file, which (at a minimum) must include the names of all the buckets and an indication as to which cloud they’re in.   In the situation where Gen3 must support cloud-specific data access methods, Gen3 also requires the authz column (which should contain the granular access control which would represent access to the entire bucket).
+
+
+
+Creation of a file Indexing Manifest
+```
+Sample code from sdk
+```
+
+* * *
+## 4. Submit file Indexing Manifest to Indexd
+* * *
+
+
+
+Gen3 offers an [Indexing sdk toolkit](https://uc-cdis.github.io/gen3sdk-python/_build/html/tools/indexing.html) to build, validate and map all files into a Gen3 datacommons.
+
+This file should offer meta data as well as bucket mapping.
-This file should offer meta data as well as bucket mapping.
+This file should offer metadata as well as bucket mapping.
-This file should offer meta data as well as bucket mapping.
+This file should offer metadata as well as bucket mapping.
+
+| File_name | File_size | md5sum | bucket_urls | acl |
+| --- | --- | --- | --- | --- |
+| examplefile.txt | 123456 | sample_md5 | s3://example-bucket/examplefile.txt gs://example-bucket/examplefile.txt | [phs000001,c1] |
+
+* * *
+To continue your data submission return to the main [Gen3 - Data Contribution](https://gen3.org/resources/user/submit-data/#4-submit-additional-project-metadata) page.\n
@@ -0,0 +1,116 @@
+---
+title: "Gen3 - GUI Submission"
+date: 2018-09-12T20:53:20-05:00
+linktitle: /resources/user
+layout: withtoc
+menuname: userMenu
+---
+{{% markdownwrapper %}}
+# GUI Submission of Data Files
+* * *
+
+The following guide details the steps a data contributor must take to submit a project to a Gen3 data commons with the Graphical User Interface (GUI).
+
+* * *
+
+## 1. Prepare Project in Submission Portal
+* * *
+<!--
+This section could be removed if we (semi)automate CMC creation for users
+-->
+
+In order to upload data files, at least one record in the `core_metadata_collection` node must exist. If your project already has at least one record in this node, you can skip to step 2 below.
+
+Do the following to create your first `core_metadata_collection` record:
+
+1. Go to your data commons' submission portal website
+2. Click on 'Submit Data'
+3. Find your project in the list of Projects and click 'Submit Data'
+4. Click 'Use Form Submission' and choose `core_metadata_collection` from the dropdown list (or [edit and upload this TSV](gen3_core_metadata_collection_template.tsv) by clicking 'Upload File' then 'Submit')
+
+    ![node_dropdown.png](node_dropdown.png)
+
+    ![cmc_form.png](cmc_form.png)
+
+
+5. Fill in the required information (see note below)
+6. Click 'Upload submission json from form' and then 'Submit'
+7. Make note of the `submitter_id` of your `core_metadata_collection` record for step 3 below
+
+>__Note:__ Minimally, `submitter_id` and `projects.code` are required properties. The project `code` is the name of your project without the "program-" prefix. For example, if your project URL is https://gen3.datacommons.io/example-training, your project's `code` would be 'training', the `program` would be 'example', and your `project_id` would be the combination: 'example-training'.
+
+You should have received the message:
+
+```
+succeeded: 200
+Successfully created entities: 1 of core_metadata_collection
+```
+
+If you received any other message, then check the 'Details' to help determine the error.
+
+To view the records in the `core_metadata_collection` node in your project, you can go to:
+https://gen3.datacommons.io/example-training/search?node_type=core_metadata_collection
+(replacing the first part of that URL with the URL of your actual project).
+
+## 2. Upload Data Files to Object Storage
+* * *
+
+Data files such as spreadsheets, sequencing data (BAM, FASTQ), assay results, images, PDFs, etc., are uploaded to object storage with the [gen3-client command-line tool](/resources/user/gen3-client).
+
+ >__Note:__ if your data files are already located in cloud storage, such as an AWS or GCS bucket, please see [this page](/resources/user/submit-data/sower) on how to make these files available in a Gen3 data commons.
+
+1. Download the latest [compiled binary](https://github.com/uc-cdis/cdis-data-client/releases/latest) for your operating system.
+2. Configure a profile with credentials downloaded from your Profile:
+
+	```
+	./gen3-client configure --profile=<profile_name> --cred=<credentials.json> --apiendpoint=<api_endpoint_url>
+
+	```
+3. Upload Files: single data file, a directory of files, or matching files:
+
+	```
+	./gen3-client upload --profile=<profile_name> --upload-path=~/files/example.txt
+
+	./gen3-client upload --profile=<profile_name> --upload-path=~/files/
+
+	./gen3-client upload --profile=<profile_name> --upload-path=~/files/*.txt
+
+	```
+
+For detailed instructions on configuring and using the gen3-client, visit the [Gen3 client documentation](/resources/user/gen3-client).
+
+## 3. Map Uploaded Files to a Data File Node
+* * *
+
+Once data files are successfully uploaded, the files must be mapped to the appropriate node in the data model before they're accessible to authorized users.
+
+1. Go to your data commons submission portal website.
+
+2. Click 'Submit Data'.
+
+    ![submit-data.png](submit-data.png)
+
+3. Click 'Map My Files' button.
+
+    ![map_my_files.png](map-my-files.png)
+
+4. Select the files to map using the checkboxes and click 'Map Files' button.
+
+    ![select-files.png](select-files.png)
+
+5. Select the project and node that the files belong to.
+
+    ![map-to-node.png](map-to-node.png)
+
+6. Fill in the values of any required properties and click 'Submit' button.
+
+    ![fill-required-properties.png](fill-required-properties.png)
+
+
+> __Note:__ The required property 'Type' in step 6 is the node's name (the 'type' of node) and should be the same as the value selected from the node dropdown list in step 5.
+
+You should receive the message "# files mapped successfully!" upon success.
+
+* * *
+
+To continue your data submission return to the main [Gen3 - Data Contribution](/resources/user/submit-data/#4-submit-additional-project-metadata) page.
@@ -1 +1 @@
-type	project_id	submitter_id	projects.code	contributor	coverage	creator	data_type	date	description	format	language	publisher	relation	rights	source	subject	titlecore_metadata_collection	example-training	collection-01	training														
+type	project_id	submitter_id	projects.code	contributor	coverage	creator	data_type	date	description	format	language	publisher	relation	rights	source	subject	titlecore_metadata_collection	example-training	collection-01	training