Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

first draft diirm doc #211

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

first draft diirm doc #211

wants to merge 3 commits into from

Conversation

DanBiber
Copy link
Contributor

Jira Ticket: PXP-xxxx

  • Remove this line if you've changed the title to (PXP-xxxx): <title>

New Features

Breaking Changes

Bug Fixes

Improvements

Dependency updates

Deployment changes

@DanBiber DanBiber requested review from a team as code owners December 15, 2023 01:13
@@ -0,0 +1,118 @@
---
title: "Gen3 - DIIRM Submission"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I honestly don't like the name DIIRM (not sure how Bob feels about it either at this point). I think we should just call this Gen3 Data Ingestion.

# DIIRM Submission of Data Files
* * *

The following guide details the steps a data contributor must take to submit project data to a Gen3 data commons with the Data, Ingest, Index, Resource Management (DIIRM) system.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Data indexing, ingestion and release management


* * *

## 1. Prepare Project with the Gen3 sdk tools

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## 1. Prepare Project with the Gen3 sdk tools
## 1. Prepare Project with the Gen3 SDK tools


## 1. Prepare Project with the Gen3 sdk tools
* * *
In order to submit data files, a Gen3 project must be present to associate the files to. The [Gen3 Submission sdk](https://uc-cdis.github.io/gen3sdk-python/_build/html/_modules/gen3/submission.html) has a comprehesive set of tools to enable users to script submission of projects.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not true in DIIRM (and something important to point out). You don't need the graph to use most of the SDK code for data ingestion

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is important b/c some projects may only use our Framework Services and not have a full Gen3 Data Commons with a graph


### Data and Access Considerations

The recommended (and simplest) way for Gen3 to provide controlled access to data is via Signed URLs. Signed URLs are the only fully cloud-agnostic method supported by Gen3 and additionally are supported by all major cloud resource providers. They also allow for short-lived, temporary access to the data for reduced risk. Lastly, utilizing signed URLs place very few restrictionso nthe organization of data within could bucket(s).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The recommended (and simplest) way for Gen3 to provide controlled access to data is via Signed URLs. Signed URLs are the only fully cloud-agnostic method supported by Gen3 and additionally are supported by all major cloud resource providers. They also allow for short-lived, temporary access to the data for reduced risk. Lastly, utilizing signed URLs place very few restrictionso nthe organization of data within could bucket(s).
The recommended (and simplest) way for Gen3 to provide controlled access to data is via Signed URLs. Signed URLs are the only fully cloud-agnostic method supported by Gen3 and additionally are supported by all major cloud resource providers. They also allow for short-lived, temporary access to the data for reduced risk. Lastly, utilizing signed URLs place very few restrictions on the organization of data within could bucket(s).


Gen3 offers an [Indexing sdk toolkit](https://uc-cdis.github.io/gen3sdk-python/_build/html/tools/indexing.html#module-gen3.tools.indexing.index_manifest) to build, validate and map all files into a Gen3 datacommons.

This file should offer meta data as well as bucket mapping.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This file should offer meta data as well as bucket mapping.
This file should offer metadata as well as bucket mapping.


This file should offer meta data as well as bucket mapping.

| File_name | File_size | md5sum | bucket_urls | acl |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

guid | size | md5 | urls | acl | authz

| examplefile.txt | 123456 | sample_md5 | s3://example-bucket/examplefile.txt gs://example-bucket/examplefile.txt | [phs000001,c1] |

* * *
To continue your data submission return to the main [Gen3 - Data Contribution](/resources/user/submit-data/#4-submit-additional-project-metadata) page.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a little misleading b/c this isn't required by DIIRM. The data submission into the graph is treated as a completely separate process and not a required one. DIIRM handles the pure indexing and metadata ingestion isolated from the graph submission

@@ -9,121 +9,37 @@ menuname: userMenu
# Submitting Data Files and Linking Metadata in a Gen3 Data Commons
* * *

The following guide details the steps a data contributor must take to submit a project to a Gen3 data commons. Feel free to take a look at our webinars about data submission to our Gen3 data commons on our [YouTube channel](https://www.youtube.com/channel/UCMCwQy4EDd1BaskzZgIOsNQ/videos).
The following guide details two methods a data contributor can take to submit a project and data to a Gen3 data commons.

Data in a Gen3 data commons are either stored in variables that are exposed to the API for query (what we refer to as 'metadata') or are stored in files that must be downloaded prior to knowing their content (or 'data files'). For more information on the difference between data files and metadata exposed to the API, see the documentation on the [data dictionary in a Gen3 data commons](/resources/user/dictionary).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of the term metadata is perhaps confusing with regards to the graph vs metadata service. We may want to adopt more specific naming


Data files such as spreadsheets, sequencing data (BAM, FASTQ), assay results, images, PDFs, etc., are uploaded to object storage with the [gen3-client command-line tool](/resources/user/gen3-client).

>__Note:__ if your data files are already located in cloud storage, such as an AWS or GCS bucket, please see [this page](/resources/user/submit-data/sower) on how to make these files available in a Gen3 data commons.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OR if you need to support Google.

I think we need a larger disclaimer that this existing Upload Data Files method is very limited in scalability and doesn't work with Google. imo we should push people to use the other method entirely, b/c ideally (once there's better tooling and docs) we should remove the old method to clean up our stack

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants