first draft diirm doc #211

DanBiber · 2023-12-15T01:13:57Z

Jira Ticket: PXP-xxxx

Remove this line if you've changed the title to (PXP-xxxx): <title>

New Features

Breaking Changes

Bug Fixes

Improvements

Dependency updates

Deployment changes

Avantol13 · 2023-12-15T15:16:58Z

content/resources/user/diirm-submission.md

@@ -0,0 +1,118 @@
+---
+title: "Gen3 - DIIRM Submission"


I honestly don't like the name DIIRM (not sure how Bob feels about it either at this point). I think we should just call this Gen3 Data Ingestion.

Avantol13 · 2023-12-15T15:17:23Z

content/resources/user/diirm-submission.md

+# DIIRM Submission of Data Files
+* * *
+
+The following guide details the steps a data contributor must take to submit project data to a Gen3 data commons with the Data, Ingest, Index, Resource Management (DIIRM) system.


Data indexing, ingestion and release management

Avantol13 · 2023-12-15T15:18:03Z

content/resources/user/diirm-submission.md

+
+* * *
+
+## 1. Prepare Project with the Gen3 sdk tools


Suggested change

## 1. Prepare Project with the Gen3 sdk tools

## 1. Prepare Project with the Gen3 SDK tools

Avantol13 · 2023-12-15T15:19:09Z

content/resources/user/diirm-submission.md

+
+## 1. Prepare Project with the Gen3 sdk tools
+* * *
+In order to submit data files, a Gen3 project must be present to associate the files to. The [Gen3 Submission sdk](https://uc-cdis.github.io/gen3sdk-python/_build/html/_modules/gen3/submission.html) has a comprehesive set of tools to enable users to script submission of projects.


This is not true in DIIRM (and something important to point out). You don't need the graph to use most of the SDK code for data ingestion

This is important b/c some projects may only use our Framework Services and not have a full Gen3 Data Commons with a graph

Avantol13 · 2023-12-15T15:20:17Z

content/resources/user/diirm-submission.md

+
+### Data and Access Considerations
+
+The recommended (and simplest) way for Gen3 to provide controlled access to data is via Signed URLs. Signed URLs are the only fully cloud-agnostic method supported by Gen3 and additionally are supported by all major cloud resource providers. They also allow for short-lived, temporary access to the data for reduced risk.  Lastly, utilizing signed URLs place very few restrictionso nthe organization of data within could bucket(s).


Suggested change

The recommended (and simplest) way for Gen3 to provide controlled access to data is via Signed URLs. Signed URLs are the only fully cloud-agnostic method supported by Gen3 and additionally are supported by all major cloud resource providers. They also allow for short-lived, temporary access to the data for reduced risk. Lastly, utilizing signed URLs place very few restrictionso nthe organization of data within could bucket(s).

The recommended (and simplest) way for Gen3 to provide controlled access to data is via Signed URLs. Signed URLs are the only fully cloud-agnostic method supported by Gen3 and additionally are supported by all major cloud resource providers. They also allow for short-lived, temporary access to the data for reduced risk. Lastly, utilizing signed URLs place very few restrictions on the organization of data within could bucket(s).

Avantol13 · 2023-12-15T15:30:14Z

content/resources/user/diirm-submission.md

+
+Gen3 offers an [Indexing sdk toolkit](https://uc-cdis.github.io/gen3sdk-python/_build/html/tools/indexing.html#module-gen3.tools.indexing.index_manifest) to build, validate and map all files into a Gen3 datacommons.
+
+This file should offer meta data as well as bucket mapping.


Suggested change

This file should offer meta data as well as bucket mapping.

This file should offer metadata as well as bucket mapping.

Avantol13 · 2023-12-15T15:55:25Z

content/resources/user/diirm-submission.md

+
+This file should offer meta data as well as bucket mapping.
+
+| File_name | File_size | md5sum | bucket_urls | acl |


Avantol13 · 2023-12-15T15:57:19Z

content/resources/user/diirm-submission.md

+| examplefile.txt | 123456 | sample_md5 | s3://example-bucket/examplefile.txt gs://example-bucket/examplefile.txt | [phs000001,c1] |
+
+* * *
+To continue your data submission return to the main [Gen3 - Data Contribution](/resources/user/submit-data/#4-submit-additional-project-metadata) page.


This is a little misleading b/c this isn't required by DIIRM. The data submission into the graph is treated as a completely separate process and not a required one. DIIRM handles the pure indexing and metadata ingestion isolated from the graph submission

Avantol13 · 2023-12-15T16:08:59Z

content/resources/user/submit-data.md

@@ -9,121 +9,37 @@ menuname: userMenu
 # Submitting Data Files and Linking Metadata in a Gen3 Data Commons
 * * *

-The following guide details the steps a data contributor must take to submit a project to a Gen3 data commons. Feel free to take a look at our webinars about data submission to our Gen3 data commons on our [YouTube channel](https://www.youtube.com/channel/UCMCwQy4EDd1BaskzZgIOsNQ/videos).
+The following guide details two methods a data contributor can take to submit a project and data to a Gen3 data commons.

 Data in a Gen3 data commons are either stored in variables that are exposed to the API for query (what we refer to as 'metadata') or are stored in files that must be downloaded prior to knowing their content (or 'data files'). For more information on the difference between data files and metadata exposed to the API, see the documentation on the [data dictionary in a Gen3 data commons](/resources/user/dictionary).


The use of the term metadata is perhaps confusing with regards to the graph vs metadata service. We may want to adopt more specific naming

Avantol13 · 2023-12-15T16:14:39Z

content/resources/user/gui-submission.md

+
+Data files such as spreadsheets, sequencing data (BAM, FASTQ), assay results, images, PDFs, etc., are uploaded to object storage with the [gen3-client command-line tool](/resources/user/gen3-client).
+
+ >__Note:__ if your data files are already located in cloud storage, such as an AWS or GCS bucket, please see [this page](/resources/user/submit-data/sower) on how to make these files available in a Gen3 data commons.


OR if you need to support Google.

I think we need a larger disclaimer that this existing Upload Data Files method is very limited in scalability and doesn't work with Google. imo we should push people to use the other method entirely, b/c ideally (once there's better tooling and docs) we should remove the old method to clean up our stack

first draft diirm doc

9ecf879

DanBiber requested review from a team as code owners December 15, 2023 01:13

Typos and link changes

c636b10

Avantol13 reviewed Dec 15, 2023

View reviewed changes

Update diirm-submission.md

213c3f9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

first draft diirm doc #211

first draft diirm doc #211

DanBiber commented Dec 15, 2023

Avantol13 Dec 15, 2023

Avantol13 Dec 15, 2023

Avantol13 Dec 15, 2023

Avantol13 Dec 15, 2023

Avantol13 Dec 15, 2023

Avantol13 Dec 15, 2023

Avantol13 Dec 15, 2023

Avantol13 Dec 15, 2023

Avantol13 Dec 15, 2023

Avantol13 Dec 15, 2023

Avantol13 Dec 15, 2023

	## 1. Prepare Project with the Gen3 sdk tools
	## 1. Prepare Project with the Gen3 SDK tools


		### Data and Access Considerations

		The recommended (and simplest) way for Gen3 to provide controlled access to data is via Signed URLs. Signed URLs are the only fully cloud-agnostic method supported by Gen3 and additionally are supported by all major cloud resource providers. They also allow for short-lived, temporary access to the data for reduced risk. Lastly, utilizing signed URLs place very few restrictionso nthe organization of data within could bucket(s).


		Gen3 offers an [Indexing sdk toolkit](https://uc-cdis.github.io/gen3sdk-python/_build/html/tools/indexing.html#module-gen3.tools.indexing.index_manifest) to build, validate and map all files into a Gen3 datacommons.

		This file should offer meta data as well as bucket mapping.

	This file should offer meta data as well as bucket mapping.
	This file should offer metadata as well as bucket mapping.


		This file should offer meta data as well as bucket mapping.

		\| File_name \| File_size \| md5sum \| bucket_urls \| acl \|


		Data files such as spreadsheets, sequencing data (BAM, FASTQ), assay results, images, PDFs, etc., are uploaded to object storage with the [gen3-client command-line tool](/resources/user/gen3-client).

		>__Note:__ if your data files are already located in cloud storage, such as an AWS or GCS bucket, please see [this page](/resources/user/submit-data/sower) on how to make these files available in a Gen3 data commons.

first draft diirm doc #211

Are you sure you want to change the base?

first draft diirm doc #211

Conversation

DanBiber commented Dec 15, 2023

New Features

Breaking Changes

Bug Fixes

Improvements

Dependency updates

Deployment changes

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment