-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cloud access to GTEx data with metadata #20
Comments
Team Calcium would love this for GTEx and TOPMed too! I think for the checksums, we would love crc32, md5, multipart md5, and sha256 btw. |
We have already created a manifest file of the GTEx raw data, with URLs and MD5 checksums.
… On May 23, 2018, at 3:11 PM, Zac Flamig ***@***.***> wrote:
Team Calcium would love this for GTEx and TOPMed too!
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
How do we get that manifest? |
I've committed three manifest files into this repository. |
The public data manifest recently added here, contains a It is important to provide an integer representation of the exact file size in bytes, as some tools might need this data to function properly (or optimally). Having the string-based, human-readable |
Hi Mike, this is now fixed. |
There is an issue with the public data manifest where some rows duplicate For example, these two rows refer to the same content, only differing by the URL in storage:
Whereas these two rows refer to different content, but with the same filename:
It is difficult to determine how to handle these kinds of records programatically. If the files are in fact the same content, then we don't really need the file uploaded to multiple different paths in cloud storage resulting in duplicate records in the manifest, and additional storage overhead. Otherwise, if the files referenced are in fact different (by virtue of md5 and cloud path) then we need a unique filename (or a relative path, e.g. Here's a list of all of the records containing duplicated
|
Hi Mike: These are separate logical files. You can see them here: https://gtexportal.org/home/datasets We release a separate, complete set of files with each new release. Between releases, sometimes those files change, sometimes they don't. But they are separate physical assets in different locations in our GCS buckets. |
Duplicate file content for logical files aside, if I am trying to prepare a
In a situation where I am trying to materialize all of the resources back into a local filesystem, I cannot resolve where all of these files with the same name are supposed to be placed without guessing your intentions based on the URL path. If you just included the relative path that is already part of the URL field as part of your It would also be great to have another field like For example, something like:
The above is unambigous and authoritative, and provides the consumer everything needed to logically restructure the files on the filesystem without requiring a priori knowledge of your cloud bucket storage hierarchy. It also has the benefit of allowing you to change those storage paths without affecting how the downstream consumer organizes the data. |
We can add a release column next week. |
Awesome! What about the relative path information? Including it as part of the |
Sure, we can do that. |
@mikedarcy I haven't forgotten about this. I will get to it next week. |
No problem. I am going to go ahead and make some bag versions from the manifest as-is. I will use the paths from the URLs to map back to the local file system structure. If things change I can easily regenerate the bags from a new version of the manifest. |
Team Argon has created In addition to these bags, we've created a bag for the V6 release that includes file references included in both the V6 and V6p (patch) releases, and an "uber-bag" that includes references to files in all releases (basically a bag of the entire manifest). We have assigned "GTEx Analysis Pilot V3 in zipped bag format": http://identifiers.org/minid:b9vm4j |
In the protected data file manifests:
There are md5 and file sizes for the cram file in the fields |
This was an oversight. We'll provide updated files shortly. |
KC7 has been working on developing a interoperable representation and instantiation of metadata that can server as a mechanism for data exchange between various full stacks. This has been based on the DATS metadata model, represented in JSON-LD and serialized using BDBags. We have been developing ETL processes for coding data from data stewards in this format.
A prerequisite for this work is that we have access to the underlying (raw) data on a accessible cloud storage platform, such as AWS or Google. It would be very helpful if we could be provided the list of Google or S3 endpoints along with expected file lengths and checksums so we could work to aggregate this data in a form that can be interoperable consumed by FS teams and KC7.
The text was updated successfully, but these errors were encountered: