Prepare data for Will's v2 Pipeline
Here's what to do!
- AWS Batch on EC2 instance will be the most fast way to analyze the dataset, see here.
- SRAtoolkit is needed to download your dataset. You can either download from here or just run
./setup-yum.sh
(this is the option labeledCloud - yum install script
on the tutorial).
- Go NCBI, enter your Bioproject, and go to the dataset.
- Click on "SRA" under 'Related information' on the right side of the page.
- You should see a pop-up above 'Links from BioProject in the center, that says "View results as an expanded interactive table using the RunSelector. Send results to RunSelector". Click on "Send to RunSelector".
- You should see a table with all the runs. Click the button labeled "Metadata" to download all the metadata (this should be called something like
SraRunTable.csv
). Then copy it to your machine using scp (for example,scp -i [location to your key] ~/Downloads/SraRunTable.csv [ec2 instance name]:/home/ec2-user/
). - Run
./create_library_and_accession.sh
to createlibraries.csv
(note we're using SRA for both library and sample) andaccession_list.txt
(this is used to download dataset). - Run
.download_fastq.sh
to download dataset into s3 bucket. - Update paths in nextflow.config, and you're good to go.