doc_dump

**Lieberman Lab SNP Analysis Pipeline**

**Brief Summary:**

This document describes the Lieberman Lab data analysis pipeline that determines evolutionary relationships within a set of closely related isolates of the same species. The pipeline detects SNPs by default and has modules for detecting regions of differential coverage (MEDs= mobile element differences). The pipeline includes three main steps:

1.     Mapping Step: This step takes raw sequencing data files for each sample and performs an alignment to a reference genome.

2.     Case Step: This step takes the alignments across all samples and produces a table of candidate SNPs.

3.     Analysis Step: This step takes a closer look at candidate SNPS, and can produce a variety of outputs, including a phylogenetic tree of the samples based on a filtered list of SNPs. There are a lot of useful functions that can be added to this local step, in the scripts folder. 

There is one main Snakemake file associated with all three step that calls on command line tools and other custom Python scripts to carry out various functions. When using a computer cluster, Snakemake can dispatch batches of jobs to the cluster as well.

A handy feature of this pipeline is that it looks for files before trying to regenerate them. If you add new samples to list of samples to be processed in the mapping step, you don't need to remove already processed samples from the list. This can sometimes cause errors for tools that produce files with 'final' names before they are completed or even if they have an error, as the pipeline will think they are complete. 

**\
**

**Pipeline Etiquette**

Most of the Python scripts you will use as part of this pipeline are shared between you and other lab members. If you make an improvement to a script, you should log the change you made at the top of the script so that everyone can refer to the revision history. If you make a significant change, please also write a brief message in the slack channel to alert other lab members. 

An example of a revision history in the form of a Python comment is below. If this section does not exist already, go ahead and add it yourself.

# Revision history:

## YYYY.MM.DD YourName: Explain the most recent revision you made and why you did it.

## YYYY.MM.DD Name: Keep previous entries here so that everyone can see the revision history.

#

In the future, we will likely transition to a better version control system, like github.

**\
**

**Errors and debugging**

When running snakemake, the primary place to look to monitor progress and identify possible errors is to look at the mastererr.txt and masterout.txt. After that, you can take a look at the err and out message for individual rules that failed (in the logs folder). went is the output from Snakemake. 

**Common errors**

-      Missing or truncated files -- One of the most common observations is that a rule will fail because the input file is technically there, but empty (small file size) or truncated. An empty file could emerge because (a) there is truly no data to report in the output from a step (e.g. all reads failed a quality filtering step) or, more likely one of the following reasons: (b) a job failed because of an error on the compute node unrelated to your code / the particular rule or (c) there is an error in the rule the formulation of the rule but the rule still outputs an empty file before throwing an error. 

-      Random failure of jobs: .err file doesn't even have Snakemake command listed, .out file is empty nothing ran. This will happen if a node is malfunctioning. Check to see if all of your failed by running sacct in a way that will show you the nodes---

sacct -u YOUR_USER_NAME -format="JobID,JobName%30,State,nodelist,cluster,elapsed,start,end" | grep FAILED 

-      

**\
**

**Mapping Step**

**Summary: **This step takes raw data files for each sample and performs an alignment to a reference genome by performing the following tasks on each sample:

1.     Trimming adapter sequences: Removes adapter sequences on either side of each read

2.     Filtering reads: Removes reads with poor quality

3.     Alignment: Aligns reads to a reference genome

Executable an dependicies to perform rule are store in conda environments

Generate log files and store errors along with standard ouput

**Key code:**

1.  cutadapt -a CTGTCTCTTAT -o {output.fq1o} {input.fq1} > {log.log1} 2>&1\
cutadapt -a CTGTCTCTTAT -o {output.fq2o} {input.fq2} > {log.log1} 2>&1

2.  sickle pe -f {input.fq1} -r {input.fq2} -t sanger -o {output.fq1o} -p {output.fq2o} -s {output.fqSo} -q 20 -l 50 -x -n > {log} 2>&1

3.  bowtie2 -X 2000 --no-mixed --dovetail --very-sensitive --n-ceil 0,0.01 --un-conc unaligned.fastq --phred64 -x genome_bowtie2 -1 filter_reads_1.fastq -2 filter_reads_2.fastq -S aligned.sam

4.  samtools view -bS -o aligned.bam aligned.sam

samtools sort aligned.bam aligned.sorted

5.  samtools mpileup -q30 -x -s -O -d3000 -f genome.fasta aligned.sorted.bam > strain.pileup

6.  samtools mpileup -q30 -t SP -d3000 -uvf genome.fasta aligned.sorted.bam > strain

bcftools call -c -Ov -o strain.vcf strain 

bcftools view -v snps -q .75 strain.vcf > variant.vcf

Code interpretation:

1.     cutadapt trims the adapters from both sides of the read; -a = specifies the adapter sequence. Output file stored as fq.gz format.

2.     sickle uses a sliding window to assess quality along the read, and trims off ends that are below the desired quality threshold; -f = input file; se = single end; -o = output file; -q = quality threshold; -l = ignore trimmed reads smaller than this length; -x = do not trim from 5' end (trim from 3' end only); -n = truncate N (and everything after it) . Takes input fq.qz files and outputs trimmed fq.qz files. Reads are discarded if below length threshold.

3.     bowtie2 attempts to align pairs of fwd+rev reads to the reference genome; -X = minimum fragment length for valid paired-end alignment; --no-mixed = no alignments for individual mates; --dovetail = allows mates to extend past each other; --very-sensitive = preset option for multiseed alignment parameters; --n-ceil = c,0, a function for maximum number of ambiguous characters allowed in read (here, the function is a constant zero); --un-conc = name of file to record reads that do not align; --phred64 = specifies encoding of base qualities; -x = reference genome file; -1 = forward read file; -2 = reverse read file; -s = output file

4.     samtools compresses the sam file into a bam file and then sorts the aligned reads by position on the reference genome 

5.     samtools aggregates data across all aligned reads for each position on the reference genome; -q = minimum mapping quality for an alignment to be used; -s = instruction to record mapping quality in output file; -O = output base position on reads; -d = maximum number of reads at a given position to include (skips reads over this coverage threshold); -f = reference genome file and alignment file; > = output file

6.     samtools creates an intermediate file that combines the reference genome with the read alignments; bcftools then calls and filters variants; -c = use samtools/bcftools' original calling method; -Ov = output variant sites only; -o = name of output file; -v = variant SNPs only; -q =  minimum allele frequency of sites to be printed; > = output file

7.     This does NOT delete duplicate reads 

The {log} 2>&1 code stores both the standard output of a function and any error message in the same log file

Command line tool references:

1.     cutadapt: <https://cutadapt.readthedocs.io/en/stable/>

2.     sickle: <https://sickle.readthedocs.io/en/latest/> and <https://github.com/najoshi/sickle>

3.     bowtie2: <http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml>

4.     samtools: <https://www.htslib.org/doc/samtools.html>

5.     bcftools: <https://samtools.github.io/bcftools/howtos/index.html> and <https://samtools.github.io/bcftools/bcftools.html>

**How to run this step on c3ddb:**

There are four files and two folders [[VK1]](applewebdata://D1AD5627-4211-4792-892D-E20AE9883444#_msocom_1) that you need to run the mapping step[[VK2]](applewebdata://D1AD5627-4211-4792-892D-E20AE9883444#_msocom_2) :

1.     samples.csv: a table listing all of your sample names (one per row), where to find the raw data files, and how you want to perform the alignment; the column headers are[[HQD3]](applewebdata://D1AD5627-4211-4792-892D-E20AE9883444#_msocom_3) : 

a.     'Batch': Directory where the raw fastq files are

b.     'Lane': don't change (no longer in use)

c.     'Barcode': don't change (no longer in use)

d.     'Sample': Unique identifier for each sample 

e.     'Alignment': Specify the name of the alignment you want to do (the name should match one of the rows in alignment_params.csv and is often just the name of your reference genome, ex. Pacnes_C1)

                                               i.     If you want to do multiple alignments, list them with spaces in between.

f.      'ProviderName':  the shared name of the raw data files (fwd/rev fastq file name without _1.fastq or _2.fastq at the end)

 You will need to update this file.

2.     Snakefile: the Snakefile script for the mapping step; this script calls other Python scripts and command line tools and can send additional jobs to the cluster.

3.     snakemakeslurm.sh: instructions for the slurm cluster to run the Snakefile

4.     cluster.slurm.json: this files contains instructions for the cluster and parameters for job submission

5.     envs: this folder includes all of the conda environments called within the Snakefile

6.     scripts: this folder includes all Python scripts called within the Snakefile

Now you're ready to run the mapping step on the cluster.

1.     Log into c3ddb:

a.     If you don't have an account, request one here: <http://www.mghpcc.org/resources/computer-systems-at-the-mghpcc/c3ddb/accounts/> . Once you've filled in the request_account form and downloaded the c3ddb folder, remember to email public key pair to c3ddb-admin@techsquare.com

b.     Use ssh to log into the cluster through the terminal:\
ssh -i <path_to_your_private_key_file> -l <your_c3ddb_username> c3ddb01.mit.edu

c.     Enter your passcode when prompted.

2.     Warning for new users: Your default quota on c3ddb is very small. If your jobs fail or get cancelled, but there aren't any errors reported, then you may need to increase your quota. Here is how you can check your quota:

srun lfs quota -u <your_c3ddb_username> /scratch

3.     Make a directory in the Lieberman Lab folder on c3ddb to hold all the files used and created during the mapping step:

a.     Example format:\
/scratch/mit_lieberman/YYYY_MM_DD_myName_myProject

b.     Here are some terminal commands that may come in handy:\
cd (change directory)\
mkdir (make directory)

4.     Upload the five files listed above to the directory you just created. 

a.     Cyberduck (<https://cyberduck.io/>) and FileZilla (<https://filezilla-project.org/>) are tools you can use to upload/download/view/edit files on c3ddb.

5.     Confirm that your reference genome(s) are already uploaded:

a.     Navigate to /scratch/mit_lieberman/Reference_Genomes/<your_reference_genome> and confirm that it contains a file called genome.fasta

b.     Here are some terminal commands that may come in handy:\
cd (change directory)\
ls (list directory contents)

6.     Run the mapping step on c3ddb:

a.     Navigate back to the mapping step directory you created (/scratch/mit_lieberman/YYYY_MM_DD_myName_myProject)

b.     Use the sbatch command in the terminal to run myjob. slurm (which runs snakemakeslurm.sh with defined parameters). First check that myjob.slurm has your email address in it, so no one else is spammed with error messages.\
$ sbatch myjob.slurm

c.     You can monitor the progress by viewing the output files or viewing the status of your jobs on the cluster:\
$ less masterout.txt (view output file)\
$ less mastererr.txt (view error file)\
$ sacct -u <your_username> (view status of your jobs on c3ddb)

d.     If you need to, this is how you can cancel all of the jobs associated with your username (if you only cancel myjob.slurm, you will not cancel the jobs that myjob.slurm created):\
$ scancel -u <your_username>

7.     Check that all the files were successfully created during the mapping step:

a.     There should be as many folders in /scratch/mit_lieberman/YYYY_MM_DD_myName_myProject/ as you had samples

b.     There should be 11 files in each sample folder .../mySample/sickle2050/<name_of_alignment>/: aligned.sorted.bam, aligned.sorted.bam.bai, alignment_info.mat, sample.dindel_output.libraries.txt, sample.dindel_output.variants.txt, strain, strain.pileup, strain.vcf, unaligned.1.fastq, unaligned.2.fastq, and variant.vcf. 

c.     Here are some examples of ways you could check to make sure the mapping step was successful.

                                               i.     If you want to, for example, count how many variant.vcf files there are across all of your samples, you can type something like this into the terminal (while in the main mapping directory) and it should return the number of samples you processed:\
$ ls */sickle2050/<name_of_alignment>/variant.vcf | grep -c variant.vcf

                                             ii.     If you want to check the file sizes of all of your strain.pileup files, you could type:\
$ ls -lSh */sickle2050/<name_of_alignment>/strain.pileup\
This should give you a list of all of the strain.pileup files and how big they are. If the list is really long, you can just list the smallest 100 files by typing:

$ ls -lSh */sickle2050/<name_of_alignment>/strain.pileup | tail -n100

**Resources and tools:**

-      c3ddb: the computing cluster we use

a.     Website: <http://www.mghpcc.org/resources/computer-systems-at-the-mghpcc/c3ddb/>

-      Cyberduck: a tool for copying files to/from c3ddb and editing files on c3ddb

a.     Download: <https://cyberduck.io/>

b.     Setup: SFTP, <username@c3ddb01.mit.edu>, select private key

-      FileZilla: another tool similar to Cyberduck (choose whichever one you like better)

a.     Download: https://filezilla-project.org/

-      tview: a command line tool for viewing alignments

a.     Load on c3ddb: $ module add c33db/samtools/1.2

b.     Documentation: <https://www.htslib.org/doc/samtools-1.2.html>

-      JSON file format: <https://www.json.org/json-en.html>

**\
**

**Case Step**

**Summary: **This step gathers information across all samples at all candidate SNP positions. It takes the alignments of each sample to the reference genome[[1]](applewebdata://D1AD5627-4211-4792-892D-E20AE9883444#_ftn1) and outputs a table of candidate SNPs across all samples along with associated quality metrics. This process includes:

1.     Summarizing output alignment files into easily readable matlab tables;

2.     Making a list of candidate SNP positions; and

3.     Collecting and storing information from each candidate SNP position across all samples.

SNP Filtering: 

-      Here, candidate SNPs occur at the set of positions where there is at least one fixed mutation across all samples. Variants from the alignment step are only considered if the FQ value is below -30.

-      FQ is "consensus quality", or the minus phred-scaled probability of all aligned reads being identical at a given position. However, this value is independent of the base at that position on the reference genome. 

Quality metrics: diversity.mat tracks 39 different quality metrics, the first 8 of which are included in the data structure called counts which is part of candidate_mutation_table.mat (see figure).

-       [1-4] A, T, C, G: number of occurrences of each nucleotide at this position on an aligned forward read

-       [5-8] a, t, c, g: number of occurrences of each nucleotide at this position on an aligned reverse read 

-       [8-16] Aq, ..., gq: the average phred qualities of all A's/.../g's

-       [17-24] Am, ..., gm: the average mapping qualities of all A's/.../g's

-       [25-32] At, ..., gt: the average tail distance of all A's/.../g's

-       [33] Ps is the p value for strand bias (fishers test)

-       [34] Pb is the p value for the base qualities being the same for the two different types of calls (1st major, 2nd major nt, either strand) (ttest)

-       [35] Pm is the p value for the mapping qualities being the same for the two different types of calls (ttest)

-       [36] Pftd is the p value for the tail distantces on the forward strand being the same for the two different types of calls (ttest)

-       [37] Pftd is the p value for the tail distantces on the reverse strand being the same for the two  different types of calls (ttest)

-       [38] E is number of calls at ends of a read

-       [39] D is number of reads supporting indels in the +/- (indelregion) bp region

**How to run this step on c3ddb:**

There are three main files you will need to run the case step:

1.     sample_names.csv: a table listing all of your sample names (one per row), where to find their alignments to the reference genome, and whether or not they are outgroups: 

a.     'ExperimentFolder': The directory that you used for the mapping step, ex. /scratch/mit_lieberman/YYYY_MM_DD_myName_myProject/

b.     'Sample': Names of your samples (you can copy/paste the 'Sample' column from samples.csv)

c.     'AlignmentFolder': The subdirectory (relative to the ExperimentFolder) where your alignment files are, ex. <sample_name>/sickle2050/Pacnes_C1/

d.     'Outgroup": 0 for a regular sample, 1 for an outgroup; outgroups will not contribute to SNP calling.

 You will need to update this file.

2.     build_mutation_table_master.m: the the matlab script for the case step; this matlab calls other matlab scripts and sends its own jobs to the cluster.

a.   Make sure that the genome name is correct on the line that defines REF_GENOME_DIRECTORY,\
ex. REF_GENOME_DIRECTORY = '/scratch/mit_lieberman/Reference_Genomes/Pacnes_C1';

3.     case.slurm: instructions for the slurm cluster to run build_mutation_table_master.m

Now you're ready to run the case step on the cluster.

1.     Create a folder the Lieberman Lab folder on c3ddb to hold all the files used and created during the case step. 

a.     Example format:\
/scratch/mit_lieberman/case_YYYY_MM_DD_myName_myProjectCase

2.     Upload the three files listed above to the directory you just created.

a.     Again, Cyberduck or FileZilla may be helpful here.

3.     Run the case step on c3ddb:

a.     Make sure you are in the directory you just created (if not, navigate there in the terminal).

b.     Use the sbatch command in the terminal to run case.slurm:\
$ sbatch case.slurm

4.     Check to make sure that the case step completed successfully.

a.     Look at caseout.txt and caseerr.txt. caseout.txt should end with "Done!" and caseerr.txt should not have any errors (java errors are OK though since this sometimes happens when matlab closes, but other errors may indicate a failed script)\
$ less caseout.txt (start looking at caseout.txt from the top) OR $ tail caseout.txt -n100 (look at the last 100 lines of caseout.txt)\
$ less caseerr.txt

b.     Make sure that the following files were created, and download them:\
coveragematrix.mat\
candidate_mutation_table.mat

**Resources/References:**

-      Pileup file format: <http://samtools.sourceforge.net/pileup.shtml>

-      VCF file format: <https://samtools.github.io/hts-specs/VCFv4.2.pdf>

**\
**

**Analysis Step**

**Summary: **This step takes a closer look at candidate SNPS, and can produce a variety of outputs, including a phylogenetic tree of the samples based on a filtered list of SNPs.

**How to run this step:**

These instructions are for running the analysis step on your own computer (although you could modify the code to run on c3ddb).

1.     Set up a local analysis directory and put the following files inside:

a.     candidate_mutation_table.mat (data generated in the previous step)

b.     coveragematrix.mat (more data generated in the previous step)

c.     analysis.m: matlab file for the analysis step

                      i.   Make sure that REFGENOMEFOLDER and SCRIPTSDIRECTORY point to the Reference_Genomes folder and scripts folder respectively on the Lieberman Lab Dropbox.\
REFGENOMEFOLDER=['<path_to_lab_dropbox_folder>/Lieberman Lab/Reference_Genomes/' refgenome];\
SCRIPTSDIRECTORY = ['<path_to_lab_dropbox_folder>/Lieberman Lab/scripts'];

                                             ii.     Make sure that refgenome is the name of the reference genome you used for alignment, ex. refgenome='Pacnes_C1';

2.     Use the interactive SNP table to examine positions, adjust filters, and rerun.

3.     This matlab script only runs a fraction of the kinds of analyses that are facilitated by already written scripts in the scripts folder. Kinds of analyses you should talk to Tami about, rather than reinventing the wheel, include:

a.     dNdS

b.     Parallel evolution

c.     Other kinds of phylogeny building algorithms

d.     File formats

**Resources:**

-      dnapars: tree-making command line tool called in analysis.m

a.     Documentation: <http://evolution.genetics.washington.edu/phylip/doc/dnapars.html>

-      FigTree: an app for viewing phylogenetic trees

a.     Website: <http://tree.bio.ed.ac.uk/software/figtree/>

-      Github repository from Tami's TB paper, showing some usage cases of functions in the scripts folder: 

a.     <https://github.com/tamilieberman/TB-diversity-across-organs>

-      ...

**\
**

**Extras**

**What is a cluster?**

c3ddb is a slurm cluster. Slurm is a cluster management and job scheduling system. If you send slurm a job, slurm decides which node(s) will execute the job and when this will happen.

Cluster schematic:

Image credit: https://www.hpc2n.umu.se/sites/default/files/Documentation/Guides/cluster.png

References:

-      Harvard slurm cluster: <https://wiki.rc.hms.harvard.edu/display/O2/Using+Slurm+Basic>

-      Another academic cluster that uses slurm: <https://wiki.deac.wfu.edu/user/index.php/SLURM:Quick_Start_Guide>

-      Slurm: <https://slurm.schedmd.com/>

**Basic bash commands:**

Note: c33db runs on linux, so when you login through ssh, you will be using bash, a linux shell (even if you logged in through a PC; if you logged in through a Mac, the type of shell will be the same, since Macs also use bash).

For navigating:

-       cd (change directory)

-       ls (list directory contents)

For creating and deleting files or directories:

-       mkdir (make directory)

-       rm (remove)

-       mv (move)

-       cp (copy)

For searching: 

-       grep (print lines matching a pattern)

For viewing files:

-       more <filename>

-       less <filename>

-       tail <filename> (for looking the end of a file)

For making quick edits to files:

-       nano <filename>

-       vim <filename>

a.      <https://www.linux.com/learn/vim-101-beginners-guide-vim>

(Otherwise, use Cyberduck or FileZilla to edit files in an application on your desktop!)

Other:

-       man <name_of_command> (manual; for viewing help pages for bash commands)

-       | (pipe)

-       chmod (use to change file read/write/execute permissions)

-       CTRL+C (stop current running shell commands)

-       TAB (autocomplete)

-       UP_ARROW (previous shell commands you've used)

Resources:

-      *Bite Sized Command Line* by Julia Evans (on Lieberman Lab Dropbox)

-      <http://tldp.org/LDP/Bash-Beginners-Guide/html/>

-      <https://linuxconfig.org/bash-scripting-tutorial>

* * * * *

[[1]](applewebdata://D1AD5627-4211-4792-892D-E20AE9883444#_ftnref1) *Note: If you did multiple alignments, then you will need to do this step multiple times.*

* * * * *

 [[VK1]](applewebdata://D1AD5627-4211-4792-892D-E20AE9883444#_msoanchor_1)Should indicate that the standard copies of these (like Snakefile, envs, etc) are mostly in /scratch/mit_lieberman/scripts/

 [[VK2]](applewebdata://D1AD5627-4211-4792-892D-E20AE9883444#_msoanchor_2)Technically also need a myjob.slurm file 

 [[HQD3]](applewebdata://D1AD5627-4211-4792-892D-E20AE9883444#_msoanchor_3)The column headers here do not match the column header of the samples.csv in /Projects/VariantPipelineDevelopment/mapping