Cloud access to GTEx data with metadata #20

carlkesselman · 2018-05-23T19:07:27Z

KC7 has been working on developing a interoperable representation and instantiation of metadata that can server as a mechanism for data exchange between various full stacks. This has been based on the DATS metadata model, represented in JSON-LD and serialized using BDBags. We have been developing ETL processes for coding data from data stewards in this format.

A prerequisite for this work is that we have access to the underlying (raw) data on a accessible cloud storage platform, such as AWS or Google. It would be very helpful if we could be provided the list of Google or S3 endpoints along with expected file lengths and checksums so we could work to aggregate this data in a form that can be interoperable consumed by FS teams and KC7.

zflamig · 2018-05-23T19:11:06Z

Team Calcium would love this for GTEx and TOPMed too! I think for the checksums, we would love crc32, md5, multipart md5, and sha256 btw.

jnedzel · 2018-05-23T19:19:38Z

We have already created a manifest file of the GTEx raw data, with URLs and MD5 checksums.

…

On May 23, 2018, at 3:11 PM, Zac Flamig ***@***.***> wrote: Team Calcium would love this for GTEx and TOPMed too! — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

cricketsloan · 2018-05-23T19:39:37Z

How do we get that manifest?

jnedzel · 2018-05-23T20:10:52Z

I've committed three manifest files into this repository.

mikedarcy · 2018-05-30T23:55:14Z

The public data manifest recently added here, contains a file_size field with values like "7.9 MiB", "15.12 KiB", and "1.82 GiB". This is inconsistent with how the file_length field is represented (in bytes) in the private data manifests (e.g., here).

It is important to provide an integer representation of the exact file size in bytes, as some tools might need this data to function properly (or optimally). Having the string-based, human-readable file_size is useful, but not a sufficient replacement for the true byte count of the data. The former can always be calculated from the latter by the consumer of the manifest.

jnedzel · 2018-05-31T16:02:51Z

Hi Mike, this is now fixed.

mikedarcy · 2018-06-01T19:22:13Z

There is an issue with the public data manifest where some rows duplicate file_name yet the URL fields refer to different locations in cloud storage. Sometimes the same actual file content is being referenced (based on the md5 provided), and sometimes an altogether different file is referenced.

For example, these two rows refer to the same content, only differing by the URL in storage:

GTEx_Data_V6_Annotations_SampleAttributesDD.xlsx	gs://gtex_analysis_v6p/annotations/GTEx_Data_V6_Annotations_SampleAttributesDD.xlsx	33902	4deba6c7c24b5cb5ed01df32518cda2d	https://storage.googleapis.com/gtex_analysis_v6p/annotations/GTEx_Data_V6_Annotations_SampleAttributesDD.xlsx
GTEx_Data_V6_Annotations_SampleAttributesDD.xlsx	gs://gtex_analysis_v6/annotations/GTEx_Data_V6_Annotations_SampleAttributesDD.xlsx	33902	4deba6c7c24b5cb5ed01df32518cda2d	https://storage.googleapis.com/gtex_analysis_v6/annotations/GTEx_Data_V6_Annotations_SampleAttributesDD.xlsx

Whereas these two rows refer to different content, but with the same filename:

description.txt	gs://gtex_analysis_v6/reference/description.txt	816	d8dc9a2e3ec3e0f5c6f0c747463009f4	https://storage.googleapis.com/gtex_analysis_v6/reference/description.txt
description.txt	gs://gtex_analysis_v6/annotations/description.txt	595	fd6e6d2fedb460d6a99b94c87718dd05	https://storage.googleapis.com/gtex_analysis_v6/annotations/description.txt

It is difficult to determine how to handle these kinds of records programatically. If the files are in fact the same content, then we don't really need the file uploaded to multiple different paths in cloud storage resulting in duplicate records in the manifest, and additional storage overhead. Otherwise, if the files referenced are in fact different (by virtue of md5 and cloud path) then we need a unique filename (or a relative path, e.g. gtex_analysis_v6/annotations/description.txt) to correlate with the referenced location.

Here's a list of all of the records containing duplicated file_name fields I encountered while processing the manifest (note that these are all entries AFTER the first encountered, each of which is not included in this list):

{"filename": "GTEx_Data_V6_Annotations_SampleAttributesDD.xlsx", "url": "https://storage.googleapis.com/gtex_analysis_v6/annotations/GTEx_Data_V6_Annotations_SampleAttributesDD.xlsx", "length": "33902", "md5": "4deba6c7c24b5cb5ed01df32518cda2d"}
{"filename": "GTEx_Data_V6_Annotations_SampleAttributesDS.txt", "url": "https://storage.googleapis.com/gtex_analysis_v6/annotations/GTEx_Data_V6_Annotations_SampleAttributesDS.txt", "length": "6285091", "md5": "6273af715b43ef89c7f9f9af8524031b"}
{"filename": "GTEx_Data_V6_Annotations_SubjectPhenotypesDS.txt", "url": "https://storage.googleapis.com/gtex_analysis_v6/annotations/GTEx_Data_V6_Annotations_SubjectPhenotypesDS.txt", "length": "11666", "md5": "5e31c42421f0ff27a4c83872027012d5"}
{"filename": "GTEx_Data_V6_Annotations_SubjectPhenotypes_DD.xlsx", "url": "https://storage.googleapis.com/gtex_analysis_v6/annotations/GTEx_Data_V6_Annotations_SubjectPhenotypes_DD.xlsx", "length": "22212", "md5": "ad5b5e461037c7ab14d920941ab0b821"}
{"filename": "description.txt", "url": "https://storage.googleapis.com/gtex_analysis_v6/annotations/description.txt", "length": "595", "md5": "fd6e6d2fedb460d6a99b94c87718dd05"}
{"filename": "Homo_sapiens_assembly19.fasta.gz", "url": "https://storage.googleapis.com/gtex_analysis_v6/reference/Homo_sapiens_assembly19.fasta.gz", "length": "857841780", "md5": "d8c8e4e848f9a16dd25f741720e668ad"}
{"filename": "description.txt", "url": "https://storage.googleapis.com/gtex_analysis_v6/reference/description.txt", "length": "816", "md5": "d8dc9a2e3ec3e0f5c6f0c747463009f4"}
{"filename": "description.txt", "url": "https://storage.googleapis.com/gtex_analysis_v6/rna_seq_data/description.txt", "length": "655", "md5": "6b693c6b74d84c506576f3abc1c0e367"}
{"filename": "description.txt", "url": "https://storage.googleapis.com/gtex_analysis_v6/single_tissue_eqtl_data/description.txt", "length": "677", "md5": "b73ab7510ad8d7eaa046606b11bb13cd"}
{"filename": "README.eqtls", "url": "https://storage.googleapis.com/gtex_analysis_v4/single_tissue_eqtl_data/README.eqtls", "length": "860", "md5": "bd4becc696b20737dfeda17dfdd61821"}
{"filename": "GTEx_genot_imputed_variants_info4_maf05_CR95_CHR_POSb37_ID_REF_ALT.txt.zip", "url": "https://storage.googleapis.com/gtex_analysis_pilot_v3/reference/GTEx_genot_imputed_variants_info4_maf05_CR95_CHR_POSb37_ID_REF_ALT.txt.zip", "length": "59886339", "md5": "5cf035742a19634b71605f93274c86c9"}
{"filename": "README.eqtls", "url": "https://storage.googleapis.com/gtex_analysis_pilot_v3/single_tissue_eqtl_data/README.eqtls", "length": "1294", "md5": "d4897fc04591c7aa5c8c5d7c7f2c3013"}

jnedzel · 2018-06-01T19:25:36Z

Hi Mike:

These are separate logical files. You can see them here: https://gtexportal.org/home/datasets

We release a separate, complete set of files with each new release. Between releases, sometimes those files change, sometimes they don't. But they are separate physical assets in different locations in our GCS buckets.

mikedarcy · 2018-06-01T19:59:57Z

Duplicate file content for logical files aside, if I am trying to prepare a bdbag for this dataset, I have no way to disambiguate the following unique files unless I "guess" at your intentions of file system organization by inspecting the logical path in the URL:

{"filename": "description.txt", "url": "https://storage.googleapis.com/gtex_analysis_v6/reference/description.txt", "length": "816", "md5": "d8dc9a2e3ec3e0f5c6f0c747463009f4"}
{"filename": "description.txt", "url": "https://storage.googleapis.com/gtex_analysis_v6/rna_seq_data/description.txt", "length": "655", "md5": "6b693c6b74d84c506576f3abc1c0e367"}
{"filename": "description.txt", "url": "https://storage.googleapis.com/gtex_analysis_v6/single_tissue_eqtl_data/description.txt", "length": "677", "md5": "b73ab7510ad8d7eaa046606b11bb13cd"}

In a situation where I am trying to materialize all of the resources back into a local filesystem, I cannot resolve where all of these files with the same name are supposed to be placed without guessing your intentions based on the URL path.

If you just included the relative path that is already part of the URL field as part of your file_name field, then my issue is solved by you making an authoritative statement as to how the downloaded data should be organized.

It would also be great to have another field like dataset where you store the dataset name, e.g., "gtex_analysis_v6", which would then give me the additional explicit and authoritative information that I need to understand how the files should logically be grouped together.

For example, something like:

file_name     dataset	object_location	file_size	md5_hash	public_url
annotations/description.txt	gtex_analysis_v6	gs://gtex_analysis_v6/annotations/description.txt	595	fd6e6d2fedb460d6a99b94c87718dd05	https://storage.googleapis.com/gtex_analysis_v6/annotations/description.txt
reference/description.txt	gtex_analysis_v6	gs://gtex_analysis_v6/reference/description.txt	816	d8dc9a2e3ec3e0f5c6f0c747463009f4	https://storage.googleapis.com/gtex_analysis_v6/reference/description.txt

The above is unambigous and authoritative, and provides the consumer everything needed to logically restructure the files on the filesystem without requiring a priori knowledge of your cloud bucket storage hierarchy. It also has the benefit of allowing you to change those storage paths without affecting how the downstream consumer organizes the data.

jnedzel · 2018-06-01T20:15:14Z

We can add a release column next week.

mikedarcy · 2018-06-01T20:30:30Z

Awesome! What about the relative path information? Including it as part of the file_name or adding another field like path would pretty much solve all of my issues...

jnedzel · 2018-06-01T20:31:18Z

Sure, we can do that.

jnedzel · 2018-06-08T15:19:26Z

@mikedarcy I haven't forgotten about this. I will get to it next week.

mikedarcy · 2018-06-08T15:58:14Z

No problem. I am going to go ahead and make some bag versions from the manifest as-is. I will use the paths from the URLs to map back to the local file system structure. If things change I can easily regenerate the bags from a new version of the manifest.

mikedarcy · 2018-06-08T21:30:35Z

Team Argon has created bdbags for each release of the public GTEx data listed in this manifest: https://github.com/dcppc/data-stewards/blob/master/gtex/v7/manifests/public_data/gtex_manifest_file.txt

In addition to these bags, we've created a bag for the V6 release that includes file references included in both the V6 and V6p (patch) releases, and an "uber-bag" that includes references to files in all releases (basically a bag of the entire manifest).

We have assigned minid identifers to each bag, and the bag content itself can be downloaded by visiting the landing page for the corresponding identifier and downloading the zip file of the bag. You can use the bdbag Python program (https://github.com/fair-research/bdbag) to automatically download a bag's consituent files and verify the content checkums. We have independently validated all of the bags posted here by downloading the content and running the bag validation process.

"GTEx Analysis Pilot V3 in zipped bag format": http://identifiers.org/minid:b9vm4j
"GTEx Analysis V4 in zipped bag format": http://identifiers.org/minid:b9qt2m
"GTEx Analysis V6 in zipped bag format" http://identifiers.org/minid:b9m401
"GTEx Analysis V6p in zipped bag format": http://identifiers.org/minid:b9g98j
"GTEx Analysis V6 (including V6p patch) in zipped bag format": http://identifiers.org/minid:b9bm4w
"GTEx Analysis V7 in zipped bag format": http://identifiers.org/minid:b96t2z
"GTEx Analysis (all releases) in zipped bag format": http://identifiers.org/minid:b9341r

mikedarcy · 2018-06-14T21:39:19Z

In the protected data file manifests:

There are md5 and file sizes for the cram file in the fields cram_file_md5 and cram_file_size, but the equivalent fields are missing for the cram_index file. Is this an oversight or is there some other reason for this data not being present?

francois-a · 2018-06-15T04:34:59Z

This was an oversight. We'll provide updated files shortly.

carlkesselman added the GTEx label May 23, 2018

zflamig added the TOPMed label May 23, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cloud access to GTEx data with metadata #20

Cloud access to GTEx data with metadata #20

carlkesselman commented May 23, 2018

zflamig commented May 23, 2018 •

edited

Loading

jnedzel commented May 23, 2018 via email

cricketsloan commented May 23, 2018

jnedzel commented May 23, 2018

mikedarcy commented May 30, 2018

jnedzel commented May 31, 2018

mikedarcy commented Jun 1, 2018

jnedzel commented Jun 1, 2018

mikedarcy commented Jun 1, 2018

jnedzel commented Jun 1, 2018

mikedarcy commented Jun 1, 2018

jnedzel commented Jun 1, 2018

jnedzel commented Jun 8, 2018

mikedarcy commented Jun 8, 2018

mikedarcy commented Jun 8, 2018

mikedarcy commented Jun 14, 2018

francois-a commented Jun 15, 2018

Cloud access to GTEx data with metadata #20

Cloud access to GTEx data with metadata #20

Comments

carlkesselman commented May 23, 2018

zflamig commented May 23, 2018 • edited Loading

jnedzel commented May 23, 2018 via email

cricketsloan commented May 23, 2018

jnedzel commented May 23, 2018

mikedarcy commented May 30, 2018

jnedzel commented May 31, 2018

mikedarcy commented Jun 1, 2018

jnedzel commented Jun 1, 2018

mikedarcy commented Jun 1, 2018

jnedzel commented Jun 1, 2018

mikedarcy commented Jun 1, 2018

jnedzel commented Jun 1, 2018

jnedzel commented Jun 8, 2018

mikedarcy commented Jun 8, 2018

mikedarcy commented Jun 8, 2018

mikedarcy commented Jun 14, 2018

francois-a commented Jun 15, 2018

zflamig commented May 23, 2018 •

edited

Loading