(Or you can go back to the Readme page here).
Your statistical power, precision, and accuracy will depend on the quantity and quality of your data. These factors, of course, depend on the starting material you use for sequencing. Excitingly, a bunch of recent studies suggest that RRGS can be used on sub-optimal samples, for example from museum specimens, fecal extractions, and the like. Nonetheless, it is in your interest to use as high quality DNA as possible.
- For animal tissue, I recommend using ethanol or RNAlater to preserve your tissues and I suggest chilling your samples (at -20 or -80 degrees) as soon as possible after collection.
- For DNA extraction, I've had success using Qiagen DNEasy extraction kits. These kits can be used in any lab that has a heat block and a centrifuge. When a small amount of starting material is being used I recommend reducing the volume of elution buffer you use in order to concentrate the DNA. You can always dilute it later, and it is harder and less efficient to make a sample more concentrated.
- If you want to outsource the library preparation (I always have), I recommend using an agarose gel to normalize the concentration of your samples and ensure that the quality is as good as possible. Different methods have different requirements in terms of the volume and concentration of gDNA, but most expect these parameters to be uniform across the samples you submit.
- If you have many samples, I suggest extracting and comparing more than you plan to run so you can choose the best ones.
I've done RADseq multiple times at Floragenix, which is located in Oregon, USA. I've also done Genotype by Sequencing at Cornell University in New York, USA. In 2015, the prices for a 95 RADseq sample run, including library preparation and one lane of single end Illumina sequencing but no bioinformatics was US$7725. In 2015, the price of a 95 GBS run with two single end lanes of Illumina sequencing was $6,080 (considerably less expensive). I think RADseq services are also available at the University of Edinburgh. I anticipate that the cost of these services will decline considerably over the next few years and that more sequencing centers will offer this service.
This website is written in a markup language called Markdown and hosted by Github. I've found both of these tools to be easy to learn and very useful.
Illumina sequence data is provided in a text file that is in a format called fastq
. This is a modification of another format called fasta
in which each sequence has a header that begins with a >
sign. This is followed by the sequences. Here is an example:
>example_sequence_in_fasta_format`
ATGCGCGCGCTAGGCTCGCGATCGGGGAGCGCGAGCTGAGCTAGCGCGATGCGCCCCGAC
The format of fastq
files is similar to fasta
except that quality scores are included. Each sequence has four lines (instead of two for fasta
files). The first begins with @
followed by information about the sequence. The second line is the nucleotide sequence. The third line is a +
which may be followed by the same information that followed the @
sign in the first line. The fourth line is the quality values. For the Illumina data we will be working with, these values range from 0–41 and are represented by single characters. More details about fastq format is available here. Here is an example of a sequence in fastq format:
@HWI-ST724:202:D127MACXX:6:2114:13665:74490
TGCAGGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGAAATTTTTGGGCACAAAGAACCACAGAAAAAAAATGAAAA
+HWI-ST724:202:D127MACXX:6:2114:13665:74490
AFHJJJFIJJJJIJJJJJHFDDDDDB0530&0)00&)0&05007BDD############################################
In this sequence the number signs indicate low quality reads at the end (right side) of the sequence.
The data we will be working witb are single end 100 bp reads from one Illumina lane. The data are from 9 individuals that were barcoded and multiplexed on this lane (see below for more explanation). The path to the complete dataset is:
/1/scratch/monkey_data/forward.fastq
Please use the ls
command to find out how large this file is.
As you (hopefully) can see, this is a LARGE file. Because the tasks we will perform take a while with this much data, I made a smaller dataset to work with here:
/1/scratch/BIO720_Bens_section/subset_data/forward_subset.fastq
In case you are interested, I made this using the unix cat
and awk
commands as follows:
cat forward.fastq | awk 'NR >= 0 && NR <= 500000 { print }' > forward_subset.fastq
Here the cat
command pipes the file called 'forward.fastqto the
awk command. Then the awk
command searches the number of records NR
(i.e. the line numbers) from 0-500,000 and prints them to a file called forward_subset.fastq
.
FYI, as with most things, I did not figure this out myself, I found it on the internet somewhere.
Before we do anything with individual sequences, it is a good idea to survey the overall quality of the data. We can do this with many free tools; for this class we will use a program called FastQC. To run this program please type this:
/usr/local/fastqc/fastqc -o output_directory datafile
where output_directory
is the directory to which you want to write the output and datafile
is the example dataset listed above (or whereever your data are). This should give you some feedback about the analysis as it runs and generate a file called datafile.zip
in the output directory.
Here is an example of a commandline I ran:
/usr/local/fastqc/fastqc -o /home/evanslab /1/scratch/BIO720_Bens_section/subset_data/forward_subset.fastq
(You would have to change the evanslab
part to match your home directory because you don't have write permissions to my directory). This command is equivalent to this:
fastqc -o ~ /1/scratch/BIO720_Bens_section/subset_data/forward_subset.fastq
The latter command works because the program fastqc
is already in your $PATH variable (thanks to Brian) and because the ~
is a shortcut for your home directory.
Please download this .zip
file to your computer, uncompress it, and open the .html
file in a browser. You should see the quality control plots that were generated that Brian went over in an earlier class.
Most RRGS methods rely on the Illumina sequencing platform. These machines generate data using something called a "flowcell" that is divided up into eight "lanes". Small scale projects typically would run multiple samples (from different species or different individuals within a species) on one lane. Because the sequence methodology requires the ligation (attachment) of a linker (a bit of DNA) to each side of bits of DNA that will be sequenced, it is straightforward to combine multiple samples (multiplex) from different individuals in a single lane. This is done by adding a unique identifier sequence (a barcode) to the linker that is used on each sample. Note that this barcode is different from "DNA barcoding", the latter of which generally refers to the use of a small variable genomic region (such as the COI gene for animals) for species and population identification.
A first step in our analysis pipeline is to organize data from each of our samples that were run together on an Illumina lane (De-multiplexing our data) and also to filter our data and trim off bits that have lots of errors or that have sequences from the laboratory procedures that were used to generate the data (Trimming/Quality control). To begin please make sure you are in your home directory by typing this:
cd ~
When samples are run on an Illumina machine, DNA is broken up into many small fragments and a small bit of DNA called an adaptor is then added on each of the fragments. This adaptor allows the sequencing process to occur, essentially by making possible high-throughput put polymerase chain reaction (ask Ben about this if you are unfamiliar). To make possible the multiplexing of samples on one Illumina lane, each sample is linked to a unique adaptor that contains a "barcode" sequence that allows us to sort out which samples each sequence came from. For our dataset, we have nine individuals from one species (the Tonkean macaque). Each of the samples received the following barcodes:
CCTCTTATCA
TATCGTTAGT
TAGTGCGGTC
GGCCGGTAAC
AGGAACCTCG
TTATCCGTAG
CGCTATACGG
CACGCAACGA
ATCCGTCTAC
These barcode correspond with the following sample names:
CCTCTTATCA PF515
TATCGTTAGT PM561
TAGTGCGGTC PM565
GGCCGGTAAC PM566
AGGAACCTCG PM567
TTATCCGTAG PM582
CGCTATACGG PM584
CACGCAACGA PM592
ATCCGTCTAC PM602
We will use this information in a moment to de-multiplex our data. Please use your favourite text editor to generate a file in your home directory (cd ~
) called monkey.barcodes
and paste in only the barcode sequences (without the sample names).
A first step in analysis of Illumina data is to identify adaptor and barcode sequences in our data, sort sequneces by the barcode, and remove the adaptor and barcode sequences from the data. We can also get rid of sequences that have ambiguous barcodes due to sequencing errors. We additionally can get rid of sequences with low quality scores and trim them all so they have the same length (this last step would not normally be done for RNAseq data but it is a reasonable thing to do for RADseq data).
Illumina generates sequences that have errors in base calls. Errors typically become more common towards the end of the sequence read, and sometimes (but not always) an "N" is inserted in positions where the base pair is difficult to call. But sometimes it makes an incorrect call as well.
We will use software package called Stacks
to de-multiplex and trim our data. This is actually a suite of programs and we will be using the application called process_radtags
within Stacks
. Stacks
has a very nice online manual here. FYI, other software that does trimming of RADseq data is available here.
Brian has installed most of the software we need in a directory called /usr/local/bin
. Before we de-multiplex our data subset, we need to make a directory for the de-multiplexed data to be stored in. From your home directory (cd ~
), please type this:
mkdir samples
The command to execute this program on our data is:
/usr/local/bin/process_radtags -f <inputfile> -b <barcode_file> -o ./samples/ -e sbfI -t 75 -r -c -q --adapter_1 GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG --adapter_2 AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT --adapter_mm 2 --filter_illumina
As detailed in the online manual, the first part (/usr/local/bin/process_radtags
) directs the computer to run the program process_radtags, which is in the director called /usr/local/bin/
. The -f
flag specifies where the data are and provides the path and filename of the data. The -b
flag specifies where the barcode file is that we made eariler and <barcode_file> provides the path and name for this file. The -o
flag tells process_radtags
to put the demultiplexed data in a folder called ./samples
, which we should make in advance using the unix mkdir
command. The -e
flag tells process_radtags
that the restriction enzyme called sbfI was used to generate the data. The -t
flag tells process_radtags
to trim all sequences to be 75 base pairs long. The -r
, -c
, and -q
flags directs process_radtags
to respectively
- rescue barcodes and RADtags when possible allowing upto a default value of 2 mismatches (this number can be changed too if you want)
- clean the data and remove reads with any uncalled bases, and
- discard reads with low quality scores.
The other flags tell process_radtags
to remove adapter sequences that are specified and to remove bad reads recognized by the Illumina sequencing software. Other details, such as the type of quality scores are set at default values. All of this information is available, of course, in the manual that comes with the program.
Here is an example of the commandline I used to de-multiplex the subset of the data:
/usr/local/bin/process_radtags -f /1/scratch/BIO720_Bens_section/subset_data/forward_subset.fastq -b monkey.barcodes -o ./samples/ -e sbfI -t 75 -r -c -q --adapter_1 GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG --adapter_2 AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT --adapter_mm 2 --filter_illumina
If you enter the samples
directory, you should see your de-multiplexed files, each named by the barcode. If you want to, you can rename these to have the sample name instead using the mv
command. For example:
mv sample_CCTCTTATCA.fq PF515.fq
You should also see a log file that gives some statistics about the output from process_radtags
As an exercise, please use the grep
command to count how many reads we have for each individual. A hint is that using grep
, you can count the number of times an identifier character for each sequence appears in each file for each individual. Another hint is that you can get the manual for any Unix
command by typing man command
. Which individual has the most reads? Which has the least reads? Can you think of a reason that some samples have lots of reads while others have less? You should be able to confirm your read count with the number in the log file from process_radtags
.
How would your grep
command differ for a fasta
file compared to a fastq
file?
Please use FastQC to evaluate read quality of your trimmed sequences. How do the trimmed reads differ from the untrimmed reads?
After class, please use process_radtags
to demultiplex the full dataset located here:
/1/scratch/monkey_data/forward.fastq