COURSE_MATERIALS/variant_calling/example/010_example.html

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
  <meta http-equiv="Content-Style-Type" content="text/css" />
  <meta name="generator" content="pandoc" />
  <meta name="author" content="Variant calling" />
  <title>NGS data analysis course</title>
  <style type="text/css">code{white-space: pre;}</style>
  <link rel="stylesheet" href="../../../course_commons/css_template_for_examples.css" type="text/css" />
</head>
<body>
<div id="header">
<h1 class="title"><a href="http://ngscourse.github.io/">NGS data analysis course</a></h1>
<h2 class="author"><strong>Variant calling</strong></h2>
<h3 class="date"><em>(updated 28-02-2014)</em></h3>
</div>
<!-- COMMON LINKS HERE -->

<h1 id="preliminaries">Preliminaries</h1>
<h2 id="software-used-in-this-practical">Software used in this practical:</h2>
<ul>
<li><a href="http://bio-bwa.sourceforge.net/" title="BWA">BWA</a> : BWA is a software package for mapping low-divergent sequences against a large reference genome, such as the human genome.</li>
<li><a href="http://samtools.sourceforge.net/" title="samtools">SAMTools</a> : SAM Tools provide various utilities for manipulating alignments in the SAM format, including sorting, merging, indexing and generating alignments in a per-position format.</li>
<li><a href="http://picard.sourceforge.net/" title="Picard">Picard</a> : Picard comprises Java-based command-line utilities that manipulate SAM files, and a Java API (SAM-JDK) for creating new programs that read and write SAM files.</li>
<li>[GATK] : (Genome Analysis Toolkit): A package to analyze next-generation re-sequencing data, primary focused on variant discovery and genotyping.</li>
</ul>
<h2 id="file-formats-explored">File formats explored:</h2>
<ul>
<li><a href="http://samtools.sourceforge.net/SAMv1.pdf">SAM</a></li>
<li><a href="http://www.broadinstitute.org/igv/bam">BAM</a></li>
<li>VCF Variant Call Format: see [1000 Genomes][vcf-format-1000ge] and [Wikipedia][vcf-format-wikipedia] specifications.</li>
</ul>
<h1 id="some-previous-configurations">Some previous configurations</h1>
<p>Just to run GATK easier:</p>
<pre><code>echo &quot;alias GenomeAnalysisTK.jar=&#39;java -jar /home/Desktop/GenomeAnalysisTK-2.8-1-g932cd3a/GenomeAnalysisTK.jar&#39;&quot; &gt;&gt; ~/.bashrc</code></pre>
<p>To run Picard easier:</p>
<pre><code>echo &quot;alias CreateSequenceDictionary.jar=&#39;java -jar /home/Desktop/picard-tools-1.108/picard-tools-1.108/CreateSequenceDictionary.jar&#39;&quot; &gt;&gt; ~/.bashrc
echo &quot;alias AddOrReplaceReadGroups.jar=&#39;java -jar /home/Desktop/picard-tools-1.108/picard-tools-1.108/AddOrReplaceReadGroups.jar&#39;&quot; &gt;&gt; ~/.bashrc
echo &quot;alias MarkDuplicates.jar=&#39;java -jar /home/Desktop/picard-tools-1.108/picard-tools-1.108/MarkDuplicates.jar&#39;&quot; &gt;&gt; ~/.bashrc
echo &quot;alias BuildBamIndex.jar=&#39;java -jar /home/Desktop/picard-tools-1.108/picard-tools-1.108/BuildBamIndex.jar&#39;&quot; &gt;&gt; ~/.bashrc</code></pre>
<h1 id="exercise-1-variant-calling-with-paired-end-data">Exercise 1: Variant calling with paired-end data</h1>
<p>Go to your course directory:</p>
<pre><code>cd &lt;my_course_directory&gt;</code></pre>
<p>Copy the data for the tutorials:</p>
<pre><code>cp -r /mounts/course_shares/Open_Share/variant_calling .
ll</code></pre>
<p>These datasets contain reads only for the chromosome 21.</p>
<h2 id="prepare-reference-genome-generate-the-fasta-file-index">1. Prepare reference genome: generate the fasta file index</h2>
<p>Use <code>BWA</code> to index the the reference genome:</p>
<pre><code>bwa index -a bwtsw f000-reference.fa</code></pre>
<p>where <code>-a bwtsw</code> specifies that we want to use the indexing algorithm that is capable of handling the whole human genome.</p>
<p>Use <code>SAMTools</code> to generate the fasta file index:</p>
<pre><code>samtools faidx f000-reference.fa</code></pre>
<p>This creates a file called f000-reference.fa.fai, with one record per line for each of the contigs in the FASTA reference file.</p>
<p>Generate the sequence dictionary using <code>Picard</code>:</p>
<pre><code>CreateSequenceDictionary.jar REFERENCE=f000-reference.fa OUTPUT=f000-reference.dict</code></pre>
<h2 id="prepare-bam-file">2. Prepare BAM file</h2>
<p>Go to the example1 folder:</p>
<pre><code>cd example1</code></pre>
<p>The <strong>read group</strong> information is key for downstream GATK functionality. The GATK will not work without a read group tag. Make sure to enter as much metadata as you know about your data in the read group fields provided. For more information about all the possible fields in the <span class="citation">@RG</span> tag, take a look at the SAM specification.</p>
<pre><code>AddOrReplaceReadGroups.jar I=f000-dna_100_high_pe.bam O=f010-dna_100_high_pe_fixRG.bam RGID=group1 RGLB=lib1 RGPL=illumina RGSM=sample1 RGPU=unit1</code></pre>
<p>We must sort and index the BAM file before processing it with Picard and GATK. To sort the bam file we use <code>samtools</code></p>
<pre><code>samtools sort f010-dna_100_high_pe_fixRG.bam f020-dna_100_high_pe_fixRG_sorted</code></pre>
<p>Index the BAM file:</p>
<pre><code>samtools index f020-dna_100_high_pe_fixRG_sorted.bam</code></pre>
<h2 id="mark-duplicates-using-picard">3. Mark duplicates (using Picard)</h2>
<p>Run the following <strong>Picard</strong> command to mark duplicates:</p>
<pre><code>MarkDuplicates.jar INPUT=f020-dna_100_high_pe_fixRG_sorted.bam OUTPUT=f030-dna_100_high_pe_fixRG_sorted_noDup.bam METRICS_FILE=metrics.txt</code></pre>
<p>This creates a sorted BAM file called <code>f030-dna_100_high_pe_fixRG_sorted_noDup.bam</code> with the same content as the input file, except that any duplicate reads are marked as such. It also produces a metrics file called <code>metrics.txt</code> containing (can you guess?) metrics.</p>
<p><strong>QUESTION:</strong> How many reads are removed as duplicates from the files (hint view the on-screen output from the two commands)?</p>
<p>Run the following <strong>Picard</strong> command to index the new BAM file:</p>
<pre><code>BuildBamIndex.jar INPUT=f030-dna_100_high_pe_fixRG_sorted_noDup.bam</code></pre>
<h2 id="local-realignment-around-indels-using-gatk">4. Local realignment around INDELS (using GATK)</h2>
<p>There are 2 steps to the realignment process:</p>
<p><strong>First</strong>, create a target list of intervals which need to be realigned</p>
<pre><code>GenomeAnalysisTK.jar -T RealignerTargetCreator -R ../f000-reference.fa -I f030-dna_100_high_pe_fixRG_sorted_noDup.bam -o f040-indelRealigner.intervals</code></pre>
<p><strong>Second</strong>, perform realignment of the target intervals</p>
<pre><code>GenomeAnalysisTK.jar -T IndelRealigner -R ../f000-reference.fa -I f030-dna_100_high_pe_fixRG_sorted_noDup.bam -targetIntervals f040-indelRealigner.intervals \
-o f040-dna_100_high_pe_fixRG_sorted_noDup_realigned.bam </code></pre>
<p>This creates a file called <code>f040-dna_100_high_pe_fixRG_sorted_noDup_realigned.bam</code> containing all the original reads, but with better local alignments in the regions that were realigned.</p>
<h2 id="base-quality-score-recalibration-using-gatk">5. Base quality score recalibration (using GATK)</h2>
<p>Two steps:</p>
<p><strong>First</strong>, analyze patterns of covariation in the sequence dataset</p>
<pre><code>GenomeAnalysisTK.jar -T BaseRecalibrator -R ../f000-reference.fa -I f040-dna_100_high_pe_fixRG_sorted_noDup_realigned.bam -knownSites ../f000-dbSNP_chr21.vcf \  
-o f050-recalibration_data.table</code></pre>
<p>This creates a GATKReport file called <code>f050-recalibration_data.table</code> containing several tables. These tables contain the covariation data that will be used in a later step to recalibrate the base qualities of your sequence data.</p>
<p>It is imperative that you provide the program with a set of <strong>known sites</strong>, otherwise it will refuse to run. The known sites are used to build the covariation model and estimate empirical base qualities. For details on what to do if there are no known sites available for your organism of study, please see the online GATK documentation.</p>
<p><strong>Second</strong>, apply the recalibration to your sequence data</p>
<pre><code>GenomeAnalysisTK.jar -T PrintReads -R ../f000-reference.fa -I f040-dna_100_high_pe_fixRG_sorted_noDup_realigned.bam -BQSR f050-recalibration_data.table \  
-o f050-dna_100_high_pe_fixRG_sorted_noDup_realigned_recalibrated.bam</code></pre>
<p>This creates a file called <code>f050-dna_100_high_pe_fixRG_sorted_noDup_realigned_recalibrated.bam</code> containing all the original reads, but now with exquisitely accurate base substitution, insertion and deletion quality scores. By default, the original quality scores are discarded in order to keep the file size down. However, you have the option to retain them by adding the flag <code>–emit_original_quals</code> to the <code>PrintReads</code> command, in which case the original qualities will also be written in the file, tagged OQ.</p>
<h2 id="variant-calling-using-gatk---unifiedgenotyper">6. Variant calling (using GATK - <strong>UnifiedGenotyper</strong>)</h2>
<p>SNPs and INDELS are called using separate instructions.</p>
<p><strong>SNP calling</strong></p>
<pre><code>GenomeAnalysisTK.jar -T UnifiedGenotyper -R ../f000-reference.fa -I f050-dna_100_high_pe_fixRG_sorted_noDup_realigned_recalibrated.bam -glm SNP \  
-o f060-dna_100_high_pe_snps.vcf</code></pre>
<p><strong>INDEL calling</strong></p>
<pre><code>GenomeAnalysisTK.jar -T UnifiedGenotyper -R ../f000-reference.fa -I f050-dna_100_high_pe_fixRG_sorted_noDup_realigned_recalibrated.bam -glm INDEL \  
-o f060-dna_100_high_pe_indels.vcf</code></pre>
<h2 id="introduce-filters-in-the-vcf-file">7. Introduce filters in the VCF file</h2>
<p>Example: filter SNPs with low confidence calling (QD &lt; 12.0) and flag them as “LowConf”.</p>
<pre><code>GenomeAnalysisTK.jar -T VariantFiltration -R ../f000-reference.fa -V f060-dna_100_high_pe_snps.vcf --filterExpression &quot;QD &lt; 12.0&quot; --filterName &quot;LowConf&quot; \  
-o f070-dna_100_high_pe_snps_filtered.vcf</code></pre>
<p>The command <code>--filterExpression</code> will read the INFO field and check whether variants satisfy the requirement. If a variant does not satisfy your filter expression, the field FILTER will be filled with the indicated <code>--filterName</code>. These commands can be called several times indicating different filtering expression (i.e: –filterName One –filterExpression “X &lt; 1” –filterName Two –filterExpression “X &gt; 2”).</p>
<p><strong>QUESTION:</strong> How many “LowConf” variants have we obtained?</p>
<pre><code>grep LowConf f070-dna_100_high_pe_snps_filtered.vcf | wc -l</code></pre>
</body>
</html>