Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide raw data and scripts #2

Open
jszinger opened this issue Apr 13, 2020 · 4 comments
Open

Provide raw data and scripts #2

jszinger opened this issue Apr 13, 2020 · 4 comments

Comments

@jszinger
Copy link

Please provide links to the raw data and the scripts necessary to format them for JBrowse. I would like to set up my own instance using data of known provenance and a proven chain of custody. Pulling prefomatted data from the cloud does not meet this requirement.

@scottcain
Copy link
Member

Hi @jszinger ,

I'm guessing you're referring to the SRA tracks: the raw data for each track can be obtained by following the link in the "about this track" dialog box, which you get from mousing over the track label and clicking on menu down triangle. Also mentioned in the "about" box is a link to the analysis performed by the Galaxy people, from whom I got the VCF files. While the url is to the top level of the analysis repo, you can get more information about the variant analysis by digging a few directories down to the variant readme: https://github.com/galaxyproject/SARS-CoV-2/blob/master/genomics/4-Variation/README.md

The only thing I did to the VCF files after getting them from the Galaxy folks was to change the name of the reference sequence (in Galaxy they used "NC_045512" in JBrowse I used "NC_045512.2") and then filter out variants with a frequency of less than 1%, which I did with a simple perl one liner: perl -ni.bak -e 'if ($_=~/^NC_045512.2/ and $_=~/AF=0.00/) {next;} else {print;}' *.vcf and then bgzip and tabix index them so JBrowse could read them.

Is that what you're looking for?
Scott

@jszinger
Copy link
Author

I'm actually asking about the other tracks: CDS, Genes, primers and multi alignment. For example, there's a bunch of processing that needs to happen to https://www.ncbi.nlm.nih.gov/nuccore/NC_045512 before it can be displayed by JBrowse---I wish to know the details of retreival and processing.

Thanks,
Jim

@scottcain
Copy link
Member

Ah, OK. The data processing for that is relatively straight forward. It would require getting the fasta and GFF files for NC_045512 from the page you linked to by clicking on the "send to" link, and selecting complete record, file for the destination, and then selecting FASTA and GFF3 from the drop down menu for format.

Once you have the files, you first run bin/prepare-refseqs.pl --fasta <name of fasta file> in the jbrowse directory (that you either got by downloading from jbrowse.org or doing a git clone https://github.com/GMOD/jbrowse.git and followed the build instructions, that basically boil down to running ./setup.sh). That creates the "reference sequence" track in JBrowse.

Then you can process data from the GFF3 file to get tracks for genes and CDS. The command generally looks like bin/flatfile-to-json.pl --gff <filename> --type <gene or CDS> --trackType CanvasFeatures --key <genes or CDS> --label <genes or CDS> This command generates a set of json files that JBrowse uses to display the gene and CDS tracks. Display changes I made that are the defaults (like stealing the color scheme for the CDS features from NextStrain) are encoded in the trackList.json file. (The trackList.json file is created when you run the prepare-refseqs script and is added to when you run flatfile-to-json.)

The primers tracks resulted from me "scraping" the primer sequences from the linked resources and using the "Add sequence search track" for each primer sequence so that I could identify the coordinates and writing a GFF3 file by hand and processing it with flatfile-to-json similar to above. The primers.gff file in this repo is the result of those searches.

The multialignment track I know I little bit less about: The BED file I used was created by @cmdcolin and I just grabbed the data. I know that it was fairly straight forward, using data obtained from GenBank for all SARS-CoV-2 sequences and then downloading them as a multialignment fasta file and then processing into a BED file that is then tabix indexed. Yes, that feels a little hand-wavy; perhaps @cmdcolin can fill in a little bit of detail if you like. I added the track configuration for this track to the trackList.json file by hand.

This is a fairly brief overview but should do the job of letting you know how the data were processed. If you want to do something similar, please feel free to email the JBrowse mailing list at [email protected] or hit us up in Gitter: https://gitter.im/GMOD/jbrowse

@scottcain
Copy link
Member

@jszinger ,

If you feel like the above descriptions are adequate, let me know and I can add them to the "about this track" for each track where it makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants