Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cloud access to GTEx data with metadata #20

Open
carlkesselman opened this issue May 23, 2018 · 17 comments
Open

Cloud access to GTEx data with metadata #20

carlkesselman opened this issue May 23, 2018 · 17 comments

Comments

@carlkesselman
Copy link

KC7 has been working on developing a interoperable representation and instantiation of metadata that can server as a mechanism for data exchange between various full stacks. This has been based on the DATS metadata model, represented in JSON-LD and serialized using BDBags. We have been developing ETL processes for coding data from data stewards in this format.

A prerequisite for this work is that we have access to the underlying (raw) data on a accessible cloud storage platform, such as AWS or Google. It would be very helpful if we could be provided the list of Google or S3 endpoints along with expected file lengths and checksums so we could work to aggregate this data in a form that can be interoperable consumed by FS teams and KC7.

@zflamig
Copy link

zflamig commented May 23, 2018

Team Calcium would love this for GTEx and TOPMed too! I think for the checksums, we would love crc32, md5, multipart md5, and sha256 btw.

@jnedzel
Copy link
Contributor

jnedzel commented May 23, 2018 via email

@cricketsloan
Copy link

How do we get that manifest?

@jnedzel
Copy link
Contributor

jnedzel commented May 23, 2018

I've committed three manifest files into this repository.

@mikedarcy
Copy link

The public data manifest recently added here, contains a file_size field with values like "7.9 MiB", "15.12 KiB", and "1.82 GiB". This is inconsistent with how the file_length field is represented (in bytes) in the private data manifests (e.g., here).

It is important to provide an integer representation of the exact file size in bytes, as some tools might need this data to function properly (or optimally). Having the string-based, human-readable file_size is useful, but not a sufficient replacement for the true byte count of the data. The former can always be calculated from the latter by the consumer of the manifest.

@jnedzel
Copy link
Contributor

jnedzel commented May 31, 2018

Hi Mike, this is now fixed.

@mikedarcy
Copy link

There is an issue with the public data manifest where some rows duplicate file_name yet the URL fields refer to different locations in cloud storage. Sometimes the same actual file content is being referenced (based on the md5 provided), and sometimes an altogether different file is referenced.

For example, these two rows refer to the same content, only differing by the URL in storage:

GTEx_Data_V6_Annotations_SampleAttributesDD.xlsx	gs://gtex_analysis_v6p/annotations/GTEx_Data_V6_Annotations_SampleAttributesDD.xlsx	33902	4deba6c7c24b5cb5ed01df32518cda2d	https://storage.googleapis.com/gtex_analysis_v6p/annotations/GTEx_Data_V6_Annotations_SampleAttributesDD.xlsx
GTEx_Data_V6_Annotations_SampleAttributesDD.xlsx	gs://gtex_analysis_v6/annotations/GTEx_Data_V6_Annotations_SampleAttributesDD.xlsx	33902	4deba6c7c24b5cb5ed01df32518cda2d	https://storage.googleapis.com/gtex_analysis_v6/annotations/GTEx_Data_V6_Annotations_SampleAttributesDD.xlsx

Whereas these two rows refer to different content, but with the same filename:

description.txt	gs://gtex_analysis_v6/reference/description.txt	816	d8dc9a2e3ec3e0f5c6f0c747463009f4	https://storage.googleapis.com/gtex_analysis_v6/reference/description.txt
description.txt	gs://gtex_analysis_v6/annotations/description.txt	595	fd6e6d2fedb460d6a99b94c87718dd05	https://storage.googleapis.com/gtex_analysis_v6/annotations/description.txt

It is difficult to determine how to handle these kinds of records programatically. If the files are in fact the same content, then we don't really need the file uploaded to multiple different paths in cloud storage resulting in duplicate records in the manifest, and additional storage overhead. Otherwise, if the files referenced are in fact different (by virtue of md5 and cloud path) then we need a unique filename (or a relative path, e.g. gtex_analysis_v6/annotations/description.txt) to correlate with the referenced location.

Here's a list of all of the records containing duplicated file_name fields I encountered while processing the manifest (note that these are all entries AFTER the first encountered, each of which is not included in this list):

{"filename": "GTEx_Data_V6_Annotations_SampleAttributesDD.xlsx", "url": "https://storage.googleapis.com/gtex_analysis_v6/annotations/GTEx_Data_V6_Annotations_SampleAttributesDD.xlsx", "length": "33902", "md5": "4deba6c7c24b5cb5ed01df32518cda2d"}
{"filename": "GTEx_Data_V6_Annotations_SampleAttributesDS.txt", "url": "https://storage.googleapis.com/gtex_analysis_v6/annotations/GTEx_Data_V6_Annotations_SampleAttributesDS.txt", "length": "6285091", "md5": "6273af715b43ef89c7f9f9af8524031b"}
{"filename": "GTEx_Data_V6_Annotations_SubjectPhenotypesDS.txt", "url": "https://storage.googleapis.com/gtex_analysis_v6/annotations/GTEx_Data_V6_Annotations_SubjectPhenotypesDS.txt", "length": "11666", "md5": "5e31c42421f0ff27a4c83872027012d5"}
{"filename": "GTEx_Data_V6_Annotations_SubjectPhenotypes_DD.xlsx", "url": "https://storage.googleapis.com/gtex_analysis_v6/annotations/GTEx_Data_V6_Annotations_SubjectPhenotypes_DD.xlsx", "length": "22212", "md5": "ad5b5e461037c7ab14d920941ab0b821"}
{"filename": "description.txt", "url": "https://storage.googleapis.com/gtex_analysis_v6/annotations/description.txt", "length": "595", "md5": "fd6e6d2fedb460d6a99b94c87718dd05"}
{"filename": "Homo_sapiens_assembly19.fasta.gz", "url": "https://storage.googleapis.com/gtex_analysis_v6/reference/Homo_sapiens_assembly19.fasta.gz", "length": "857841780", "md5": "d8c8e4e848f9a16dd25f741720e668ad"}
{"filename": "description.txt", "url": "https://storage.googleapis.com/gtex_analysis_v6/reference/description.txt", "length": "816", "md5": "d8dc9a2e3ec3e0f5c6f0c747463009f4"}
{"filename": "description.txt", "url": "https://storage.googleapis.com/gtex_analysis_v6/rna_seq_data/description.txt", "length": "655", "md5": "6b693c6b74d84c506576f3abc1c0e367"}
{"filename": "description.txt", "url": "https://storage.googleapis.com/gtex_analysis_v6/single_tissue_eqtl_data/description.txt", "length": "677", "md5": "b73ab7510ad8d7eaa046606b11bb13cd"}
{"filename": "README.eqtls", "url": "https://storage.googleapis.com/gtex_analysis_v4/single_tissue_eqtl_data/README.eqtls", "length": "860", "md5": "bd4becc696b20737dfeda17dfdd61821"}
{"filename": "GTEx_genot_imputed_variants_info4_maf05_CR95_CHR_POSb37_ID_REF_ALT.txt.zip", "url": "https://storage.googleapis.com/gtex_analysis_pilot_v3/reference/GTEx_genot_imputed_variants_info4_maf05_CR95_CHR_POSb37_ID_REF_ALT.txt.zip", "length": "59886339", "md5": "5cf035742a19634b71605f93274c86c9"}
{"filename": "README.eqtls", "url": "https://storage.googleapis.com/gtex_analysis_pilot_v3/single_tissue_eqtl_data/README.eqtls", "length": "1294", "md5": "d4897fc04591c7aa5c8c5d7c7f2c3013"}

@jnedzel
Copy link
Contributor

jnedzel commented Jun 1, 2018

Hi Mike:

These are separate logical files. You can see them here: https://gtexportal.org/home/datasets

We release a separate, complete set of files with each new release. Between releases, sometimes those files change, sometimes they don't. But they are separate physical assets in different locations in our GCS buckets.

@mikedarcy
Copy link

Duplicate file content for logical files aside, if I am trying to prepare a bdbag for this dataset, I have no way to disambiguate the following unique files unless I "guess" at your intentions of file system organization by inspecting the logical path in the URL:

{"filename": "description.txt", "url": "https://storage.googleapis.com/gtex_analysis_v6/reference/description.txt", "length": "816", "md5": "d8dc9a2e3ec3e0f5c6f0c747463009f4"}
{"filename": "description.txt", "url": "https://storage.googleapis.com/gtex_analysis_v6/rna_seq_data/description.txt", "length": "655", "md5": "6b693c6b74d84c506576f3abc1c0e367"}
{"filename": "description.txt", "url": "https://storage.googleapis.com/gtex_analysis_v6/single_tissue_eqtl_data/description.txt", "length": "677", "md5": "b73ab7510ad8d7eaa046606b11bb13cd"}

In a situation where I am trying to materialize all of the resources back into a local filesystem, I cannot resolve where all of these files with the same name are supposed to be placed without guessing your intentions based on the URL path.

If you just included the relative path that is already part of the URL field as part of your file_name field, then my issue is solved by you making an authoritative statement as to how the downloaded data should be organized.

It would also be great to have another field like dataset where you store the dataset name, e.g., "gtex_analysis_v6", which would then give me the additional explicit and authoritative information that I need to understand how the files should logically be grouped together.

For example, something like:

file_name     dataset	object_location	file_size	md5_hash	public_url
annotations/description.txt	gtex_analysis_v6	gs://gtex_analysis_v6/annotations/description.txt	595	fd6e6d2fedb460d6a99b94c87718dd05	https://storage.googleapis.com/gtex_analysis_v6/annotations/description.txt
reference/description.txt	gtex_analysis_v6	gs://gtex_analysis_v6/reference/description.txt	816	d8dc9a2e3ec3e0f5c6f0c747463009f4	https://storage.googleapis.com/gtex_analysis_v6/reference/description.txt

The above is unambigous and authoritative, and provides the consumer everything needed to logically restructure the files on the filesystem without requiring a priori knowledge of your cloud bucket storage hierarchy. It also has the benefit of allowing you to change those storage paths without affecting how the downstream consumer organizes the data.

@jnedzel
Copy link
Contributor

jnedzel commented Jun 1, 2018

We can add a release column next week.

@mikedarcy
Copy link

Awesome! What about the relative path information? Including it as part of the file_name or adding another field like path would pretty much solve all of my issues...

@jnedzel
Copy link
Contributor

jnedzel commented Jun 1, 2018

Sure, we can do that.

@jnedzel
Copy link
Contributor

jnedzel commented Jun 8, 2018

@mikedarcy I haven't forgotten about this. I will get to it next week.

@mikedarcy
Copy link

No problem. I am going to go ahead and make some bag versions from the manifest as-is. I will use the paths from the URLs to map back to the local file system structure. If things change I can easily regenerate the bags from a new version of the manifest.

@mikedarcy
Copy link

Team Argon has created bdbags for each release of the public GTEx data listed in this manifest: https://github.com/dcppc/data-stewards/blob/master/gtex/v7/manifests/public_data/gtex_manifest_file.txt

In addition to these bags, we've created a bag for the V6 release that includes file references included in both the V6 and V6p (patch) releases, and an "uber-bag" that includes references to files in all releases (basically a bag of the entire manifest).

We have assigned minid identifers to each bag, and the bag content itself can be downloaded by visiting the landing page for the corresponding identifier and downloading the zip file of the bag. You can use the bdbag Python program (https://github.com/fair-research/bdbag) to automatically download a bag's consituent files and verify the content checkums. We have independently validated all of the bags posted here by downloading the content and running the bag validation process.

"GTEx Analysis Pilot V3 in zipped bag format": http://identifiers.org/minid:b9vm4j
"GTEx Analysis V4 in zipped bag format": http://identifiers.org/minid:b9qt2m
"GTEx Analysis V6 in zipped bag format" http://identifiers.org/minid:b9m401
"GTEx Analysis V6p in zipped bag format": http://identifiers.org/minid:b9g98j
"GTEx Analysis V6 (including V6p patch) in zipped bag format": http://identifiers.org/minid:b9bm4w
"GTEx Analysis V7 in zipped bag format": http://identifiers.org/minid:b96t2z
"GTEx Analysis (all releases) in zipped bag format": http://identifiers.org/minid:b9341r

@mikedarcy
Copy link

In the protected data file manifests:

There are md5 and file sizes for the cram file in the fields cram_file_md5 and cram_file_size, but the equivalent fields are missing for the cram_index file. Is this an oversight or is there some other reason for this data not being present?

@francois-a
Copy link
Contributor

This was an oversight. We'll provide updated files shortly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants