Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GTEx XML submission for SRA/SDDP #18

Open
webermn opened this issue May 2, 2018 · 10 comments
Open

GTEx XML submission for SRA/SDDP #18

webermn opened this issue May 2, 2018 · 10 comments

Comments

@webermn
Copy link

webermn commented May 2, 2018

To facilitate access to GTEx data to the DCPPC, can the Broad Data Steward team please submit XML to the NCBI Sequence Read Archive (SRA) Sequence Data Delivery Pilot (SDDP) for all GTEx data that is being shared with the Data Commons Consortium?

The submission should describe the data and its location on both the Google and Amazon clouds using the XML schema as described in the attached PDF and in these examples: ftp://ftp.ncbi.nih.gov/sra/examples/cloud_examples/

Adam Stine from NCBI ([email protected]) is the point of contact for questions regarding this submission and can assist with linking this submission to existing GTEx records. I'm also happy to set up a meeting with relevant folks from Broad and elsewhere to discuss in more detail.

Would a reasonable target for task completion be, say, sometime next week?

cc/ @francois-a @jnedzel @saulakravitz
SRA-XMLCloudFormatGuide-250418-1832-37.pdf

@owhite
Copy link
Contributor

owhite commented May 2, 2018

Nick - out of curiosity could you outline how this plays a role with the connection of DCPPC data and GTEx? just curious about the mechanics of what's happening.

@clarisca
Copy link

clarisca commented May 3, 2018

@webermn : where can we find more information about "NCBI Sequence Read Archive (SRA) Sequence Data Delivery Pilot (SDDP)" ? Is this a pilot developed for the DCPPC? @krobasky

@krobasky
Copy link

krobasky commented May 3, 2018 via email

@saulakravitz
Copy link

saulakravitz commented May 3, 2018 via email

@krobasky
Copy link

krobasky commented May 3, 2018 via email

@webermn
Copy link
Author

webermn commented May 3, 2018

@owhite / @clarisca / @krobasky:

Thanks for your interest. I agree that getting more information out on this will be useful.

Hopefully the comment from @saulakravitz provides some additional context about the Sequence Data Delivery Pilot and the associated tools on AWS and Google (including examples with TOPMed and 1000 Genomes data), but perhaps we can also consider some or all of the following as well for the DCPPC:

  1. Determine for which group(s) an SDDP presentation/demo/discussion would be useful. (Perhaps full stacks and KC6 initially? Any others?)

  2. Have those who are interested review available docs and materials (see list below) in advance of a potential meeting and live demo

  1. Figure out a way to more broadly socialize aspects of data access and management; SDDP is one piece, but it may help to consider it alongside other approaches and to determine how to engage more than just those who subscribe to notifications on this issues list. (For one, I think it would be great to learn what others in the Consortium are already doing that could offer alternatives/improvements in this area.)

I hope this helps. I’m glad to find time to discuss further, and welcome thoughts on how to do that efficiently and with the right audience(s).

@krobasky
Copy link

krobasky commented May 3, 2018

Two Questions:

  1. If I understand correctly these tools are for working with FASTQs, but aren't the Full Stacks intended to work only with the VCF's for the TOPMed data? I ask because it changes the scale considerably - e.g., I see a single study with 90TBs of runs data

  2. What does the following error mean?:

Following along from the slides, I downloaded fusera-linux-amd64 and gave it a try from my Data Commons-provisioned AWS VM. I found an NA12878 run that's hosted in an S3 bucket (e.g., DATAStore Location in the Run Selector = gs.US s3.us-east-1). The run is SRR944152, which I put in topmed.txt and ran:
./fusera-linux-amd64 mount --acc-file topmed.txt mnt
Should that work?
I got an error:

invalid arguments: gave location of s3.us-east-2, location must match one of these possibilities:

================
gs.[region]
================

regions for gs:
----------------

US

us-east1-b us-east1-c us-east1-d

us-east4-a us-east4-b us-east4-c

us-central1-a us-central1-b us-central1-c us-central1-f

us-west1-a us-west1-b us-west1-c

================
s3.[region]
================

regions for s3:
----------------

us-east-1

================
For accessing files on ncbi, use the location ftp-ncbi
================



starting fusera with given arguments failed, please review the help with -h

@saulakravitz
Copy link

saulakravitz commented May 3, 2018 via email

@krobasky
Copy link

krobasky commented May 3, 2018

Excellent, thorough answers, thank you!

Regarding 1) Having access to TOPMed FASTQs opens up a lot of possibilities. Meanwhile, I've tried analyzing data on fuse-mounted S3's and it always winds up disappearing; seemingly the i/o can't keep up - has fusera been designed to overcome those challenges or should we build accommodations into the analytical tools?

Regarding 2) 👍 I'm not sure how I wound up on us-east-2, but you're right - I've switched over to us-east-1 to try again - thanks!

So now it hangs... I don't mean to hijack this thread, is there a github issue tracker where I should log this? -- either way, thanks for your help!

$ mkdir mnt
$ time ./fusera-linux-amd64 mount --acc-file topmed.txt mnt
^C
real    13m10.035s
user    0m0.100s
sys     0m0.000s

@saulakravitz
Copy link

saulakravitz commented May 4, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants