Prepare data for Will's v2 Pipeline

Here's what to do!

Pre-requisties

AWS Batch on EC2 instance will be the most fast way to analyze the dataset, see here.
SRAtoolkit is needed to download your dataset. You can either download from here or just run ./setup-yum.sh (this is the option labeled Cloud - yum install script on the tutorial).

Go NCBI, enter your Bioproject, and go to the dataset.
Click on "SRA" under 'Related information' on the right side of the page.
You should see a pop-up above 'Links from BioProject in the center, that says "View results as an expanded interactive table using the RunSelector. Send results to RunSelector". Click on "Send to RunSelector".
You should see a table with all the runs. Click the button labeled "Metadata" to download all the metadata (this should be called something like SraRunTable.csv). Then copy it to your machine using scp (for example, scp -i [location to your key] ~/Downloads/SraRunTable.csv [ec2 instance name]:/home/ec2-user/).
Run ./create_library_and_accession.sh to create libraries.csv (note we're using SRA for both library and sample) and accession_list.txt (this is used to download dataset).
Run .download_fastq.sh to download dataset into s3 bucket.
Update paths in nextflow.config, and you're good to go.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.gitignore		.gitignore
README.md		README.md
create_library_and_accession.sh		create_library_and_accession.sh
download_fastq.sh		download_fastq.sh
setup-yum.sh		setup-yum.sh