Pipeline.txt

The pipeline include following Steps

1.Mapping of sea cow individual reads to Dugong genome
2.Bam Filter and statistics 
3.Genotyping of Sea cow reads 
4.PSMC
5.Genome Annotation
6.Orthologous Assessment
7.Codon based Alignment
8.Codeml run
9.Gene Ontology annotation
10.Population diversity estimates

#Mapping of sea cow individual reads to Dugong genome
Here we used the  fastq2bam for reading and  merging the fastq reads into BAM format. 
-m option will attempt to merge or trim reads. 
Bwa bam2bam will take the indexed reference genome file and map with the parameter set of -n (num-diff) -o (max-gap-open) -l (seed-length) -p (listen-port) -t (num-threads).

fastq2bam -m -1 Read_R1.fastq  -2 Read_R2.fastq |  bwa bam2bam -g Reference.fasta -n 0.01 -o 2 -l 16000 -p 4711 -t 48 -f Aligned.bam -

#Bam Filter and statistics
For removing duplicated and filtering based on the DNA fragment lengths and mapping quality we have used an in house perl program called analyzeBAM.pl with -minlength as 32 and -qual as 30.

#Genotyping of Sea cow reads
The genotyping of Sea cow individuals is done with snpAD tool and uses only the scaffold length  ≥ 100. The script name snpAD.sh used for genotyping with an input of scaffold names.

#PSMC
PSMC analysis requires a consensus genome sequence(fastq) that can be filtered to account for coverage and sequencing error. Only autosomal scaffolds longer than 100 kb can be used for the analysis.

1. Fastq sequence for autosomal regions
   Samtools mpileup -C50 -uf Reference.fasta Input.bam|bcftools call -c -|vcfutils.pl vcf2fq -d 10 -D 100 | gzip > diploid.fq.gz
2. Fastq to PSMC
   fq2psmcfa -q20  diploid.fq.gz > Input.psmcfa
3. Run PSMC
   psmc  -t40 -p "4+25*2+4+6" -o Input.psmc  Input.psmcfa
4. Combine out (If you have more than one Individuals) 
   cat Input.psmc Input02.psmc >combined.psmc 
5. Plot
   psmc_plot.pl -u “Mutation Rate” -g “Generation Time”  combined combined.psmc
6. Perform bootstrapping (100 rounds)
   splitfa Input.psmcfa >split-Input.psmcfa
   seq 100 |xargs -i echo psmc -t40 -b -p "4+25*2+4+6" -o Input{}.psmc split-Input.psmcfa | sh

#Genome Annotation

For performing  the annotation EST or Protein Homology evidence is needed. For this we used Elephant, Human and Mouse assembled mRNA and protein sequence. 

1. First Round
The maker runs with the control files (ctl) such as maker_bopts, maker_exe and maker_opts.
In maker_opts.ctl you need add the location/Informations on the following fields
Genome,organism_type,est and protein.Also set the following flags to enable gene prediction solely on RNA and protein evidence:
est2genome=1 #infer gene predictions directly from ESTs, 1 = yes, 0 = no 
protein2genome=1 #infer predictions from protein homology, 1 = yes, 0 = no 
After execution of  MAKER results are stored in a directory with the name of your assembly file plus .maker.output
Two accessory scripts that come with MAKER : gff3_merge and fasta_merge used for merge GFF3 and FASTA files containing all genes.
fasta_merge -d xxx_datastore_index.log
gff3_merge -d xxx_datastore_index.log

1.a: Train the gene gene predictor(SNAP)
The maker2zff will create two files such as genome.ann and genome.dna which contain information about the gene sequences as well as the actual DNA sequences. 
maker2zff  xxx_all.gff
Then run fathom as follows 
fathom genome.ann genome.dna -categorize 1000 
fathom uni.ann uni.dna -export 1000 
forge export.ann export.dna
Finally train SNAP with hmm-assemble
hmm-assembler .pl mygenome . >mygenome.hmm
2. Run Second Round
Add snaphmm= mygenome.hmm,remove the file paths to the genome,protein and est evidence or set the flags for est2genome=0 and protein2genome=0 and add xxx_all.gff (First round gff) to maker_gff, set all option under Re-annotation Using MAKER Derived GFF3 as one.

#Orthologous Assessment

The orthologous assessments begin with the blast out (table format out). For extracting the best hit from the blast out use Blast_Best_Hit.py script. 
Syntax python Blast_Best_Hit.py >Best_blast_hit.txt
The next step will be to find out for each query how many hits has got, In our study we have used Human,Mouse and Elephant proteins and need only three hits from blast out. 
sort Best_blast_hit.txt |awk '{print$1}'|sort | uniq -c | sed 's/^ *//'|grep "^3" |awk '{print $2}' > 3hits.txt
From the above list, need to find the IDs of each species by comparing to the Best_blast_hit.txt
grep -wFf 3hits.txt Best_blast_hit.txt |awk '{print$1,$2}' >Ortho_Ids.txt
For the next step we need to validate these orthologous with ensemble databases and you can download the orthologous Ids from the ensembl biomart :https://www.ensembl.org/biomart. 
Use the script Match_Gene_From_Ensembel.awk for this purpose. This script will use the Ortho_Ids.txt and ortho_ensemble_downloaded (From Ensembl).
Syntax awk Match_Gene_From_Ensembel.awk >Ensemble_Orthos.txt
Next need to filter above output with number of orthologs with count 3 (3 species)
sort Ensemble_Orthos.txt|awk '{print$1}'|uniq -c|sed 's/^ *//'|grep "^3" >Ens_Orthos_3_species.txt
And will grep the Ids 
grep -wFf Ens_Orthos_3_species.txt Ensemble_Orthos.txt > Orthologous.txt
These above all steps will run for both sets of Dugong and manatee and finally both orthologous will merge with Merge_orthologus.awk script

#Codon based Alignment

For aligning the multiple species orthologous we have used macse codon based aligner with following options 
java -jar macse_v2.03.jar -prog alignSequences -seq Input.fasta
This tool will generate alignment at the NT and AA levels
For removing the internal stop codons we have used exportAlignment function of  macse
java -jar macse_v2.04.jar -prog exportAlignment -align Input.fas -codonForInternalStop NNN

#Codeml run
For running codeml we have removed all internal stop codons from the alignment. The assessment of each gene was done with the perl script called kaks_codeml_lineage.pl 
You need to provide the tree files (phylip format) and the codeml control files to the CodeMLPairwise.pm and kaks_codeml_lineage.pl scripts . 
Syntax : 
for i in `ls input_folder`; do perl kaks_codeml_lineage.pl $i ./output/`echo $i | sed 's/.fasta/codeml_0.txt/g'` 2> ./output/`echo $i | sed 's/.fasta/codeml_0.txt.e/g'`; done
The output folder will have model and error files 
You need to run the script for model0 and model 2(Seacow as forward branch) and compare the models with likelihood ratio test 
awk 'BEGIN {while ( getline < ARGV[1] > 0 ){extradata[$1]=$0} ARGV[1]="";} {OFS="\t"; print extradata[$1], $0}' results_model0.txt  results_model2.txt | awk 'BEGIN {OFS="\t"; print "Gene_ID", "seq_length", "lnL_model0", "omega_model0", "lnL_model2", "omega_background", "omega_COL", "LRT", "Trend"} {gsub("lnL:",""); gsub(".fasta",""); if ($34 > $33) {print $1, $2, $4, $16, $20, $33, $34, 2*($20-$4), "faster"} else {print $1, $2, $4, $16, $20, $33, $34, 2*($20-$4), "slower"}}' | awk '$(NF-1) >3.84' > genes_sign_different.txt

#Gene Ontology (GO) annotation
For gene ontology analysis we used FUNC, for this ensembl gene ids of the genes to test are needed. Using the script mapping_common.py 4877 genes with orthos are mapped to the GO names (Gene Ontology (GO) annotation from Ensembl v98)

The input data sets are divided into two 
1. Stop genes 
for i in `awk '{print $1}' stops_gene_ID.txt`; do grep -w $i go_mapped.txt | awk 'BEGIN {OFS=FS="\t"}; {print $2,$7,"1"}'| awk  'NF==3'; done > forward_genes.txt

2. Background genes
for i in `awk '{print $1}' stops_gene_ID.txt`; do grep -w -v $i go_mapped.txt | awk 'BEGIN {OFS=FS="\t"}; {print $2,$7,"0"}'| awk  'NF==3'; done > background_genes.txt

Prepare Input for FUNC

cat forward_genes.txt background_genes.txt > input_stop_codon_func.txt

Run Func

func_hyper -i input_stop_codon_func.txt -t go_monthly-termdb-tables -o output-directory 

Find significant groups

To find the significant groups use the script get.sig.go_Func.sh. It will take the groups.txt in the out put folder of FUNC and generate significant_groups.txt

Next will Collect GO cats to identify the genes

input_to_go_genes.pl go_monthly-termdb-tables/term.txt go_monthly-termdb-tables/graph_path.txt input_stop_codon_func.txt > GO_cats_for_input_stop_codon_func.txt

Get the genes for the significant GO terms 
for i in `cat significant_groups.txt | cut -f 3 | grep GO`; do echo $i; grep $i GO_cats_for_input_stop_codon_func.txt | cut -f 1 > GO.txt; for j in `cat GO.txt`; do grep $j input_stop_codon_func.txt  |awk '$3==1' | cut -f 1 | sort | uniq | tr "\n" ", ";  grep $j stops_gene_ID.txt | cut -f 2 | tr "\n" "; "; done; rm GO.txt; done | sed 's/;GO:/\nGO:/g' > GO_genes.txt
Arrange it on two columns
xargs -n2 < GO_genes.txt 

#Population diversity estimates

We used Consensify to generate a consensus pseudohaploid genome sequence.

Consensify takes 3 files as input counts,pos and lengths.The .counts and .pos files can be generated from a standard bam file using the -doCounts function in angsd

ngsd -doCounts 1 -dumpCounts 3 -rf scaffold_lengths.txt -i Input.bam -out out_name

Unzip the count/pos files

gunzip -k out_name.counts.gz
gunzip -k out_name.pos.gz

Make Scaffold lenght file containing a 2 column table: scaffold name, length (Tab)
Run Consensify
Here maximum read depth filter can set as 30 (i.e. only consider positions covered by 30 reads or fewer), You can change that depends upon the alignment score

perl Consensify-master/Consensify_v0.1.pl out_name.pos out_name.counts scaffold_lengths.txt out_coverage30.fasta 30 

Consensify output were splitted into each scaffolds by using following commands 

faSplit byname scaffolds.fa outfolder/

Pairwise comparisons of individuals dated to the same time period
For pairwise comparison two individuals scaffolds were concatenated as like this
 
cat individual01/scaffold100.fa individual02/scaffold100.fa > For_Consensify/scaffold100.fa

Then count the number of differences within non overlapping blocks of 50 kb by using Pairwise_comparison.py script as follows 

FILES=For_Consensify/*
for f in $FILES
do
python Pairwise_comparison.py $f > "$f"_.count
done