-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathPipeline.txt
172 lines (128 loc) · 9.86 KB
/
Pipeline.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
The pipeline include following Steps
1.Mapping of sea cow individual reads to Dugong genome
2.Bam Filter and statistics
3.Genotyping of Sea cow reads
4.PSMC
5.Genome Annotation
6.Orthologous Assessment
7.Codon based Alignment
8.Codeml run
9.Gene Ontology annotation
10.Population diversity estimates
#Mapping of sea cow individual reads to Dugong genome
Here we used the fastq2bam for reading and merging the fastq reads into BAM format.
-m option will attempt to merge or trim reads.
Bwa bam2bam will take the indexed reference genome file and map with the parameter set of -n (num-diff) -o (max-gap-open) -l (seed-length) -p (listen-port) -t (num-threads).
fastq2bam -m -1 Read_R1.fastq -2 Read_R2.fastq | bwa bam2bam -g Reference.fasta -n 0.01 -o 2 -l 16000 -p 4711 -t 48 -f Aligned.bam -
#Bam Filter and statistics
For removing duplicated and filtering based on the DNA fragment lengths and mapping quality we have used an in house perl program called analyzeBAM.pl with -minlength as 32 and -qual as 30.
#Genotyping of Sea cow reads
The genotyping of Sea cow individuals is done with snpAD tool and uses only the scaffold length ≥ 100. The script name snpAD.sh used for genotyping with an input of scaffold names.
#PSMC
PSMC analysis requires a consensus genome sequence(fastq) that can be filtered to account for coverage and sequencing error. Only autosomal scaffolds longer than 100 kb can be used for the analysis.
1. Fastq sequence for autosomal regions
Samtools mpileup -C50 -uf Reference.fasta Input.bam|bcftools call -c -|vcfutils.pl vcf2fq -d 10 -D 100 | gzip > diploid.fq.gz
2. Fastq to PSMC
fq2psmcfa -q20 diploid.fq.gz > Input.psmcfa
3. Run PSMC
psmc -t40 -p "4+25*2+4+6" -o Input.psmc Input.psmcfa
4. Combine out (If you have more than one Individuals)
cat Input.psmc Input02.psmc >combined.psmc
5. Plot
psmc_plot.pl -u “Mutation Rate” -g “Generation Time” combined combined.psmc
6. Perform bootstrapping (100 rounds)
splitfa Input.psmcfa >split-Input.psmcfa
seq 100 |xargs -i echo psmc -t40 -b -p "4+25*2+4+6" -o Input{}.psmc split-Input.psmcfa | sh
#Genome Annotation
For performing the annotation EST or Protein Homology evidence is needed. For this we used Elephant, Human and Mouse assembled mRNA and protein sequence.
1. First Round
The maker runs with the control files (ctl) such as maker_bopts, maker_exe and maker_opts.
In maker_opts.ctl you need add the location/Informations on the following fields
Genome,organism_type,est and protein.Also set the following flags to enable gene prediction solely on RNA and protein evidence:
est2genome=1 #infer gene predictions directly from ESTs, 1 = yes, 0 = no
protein2genome=1 #infer predictions from protein homology, 1 = yes, 0 = no
After execution of MAKER results are stored in a directory with the name of your assembly file plus .maker.output
Two accessory scripts that come with MAKER : gff3_merge and fasta_merge used for merge GFF3 and FASTA files containing all genes.
fasta_merge -d xxx_datastore_index.log
gff3_merge -d xxx_datastore_index.log
1.a: Train the gene gene predictor(SNAP)
The maker2zff will create two files such as genome.ann and genome.dna which contain information about the gene sequences as well as the actual DNA sequences.
maker2zff xxx_all.gff
Then run fathom as follows
fathom genome.ann genome.dna -categorize 1000
fathom uni.ann uni.dna -export 1000
forge export.ann export.dna
Finally train SNAP with hmm-assemble
hmm-assembler .pl mygenome . >mygenome.hmm
2. Run Second Round
Add snaphmm= mygenome.hmm,remove the file paths to the genome,protein and est evidence or set the flags for est2genome=0 and protein2genome=0 and add xxx_all.gff (First round gff) to maker_gff, set all option under Re-annotation Using MAKER Derived GFF3 as one.
#Orthologous Assessment
The orthologous assessments begin with the blast out (table format out). For extracting the best hit from the blast out use Blast_Best_Hit.py script.
Syntax python Blast_Best_Hit.py >Best_blast_hit.txt
The next step will be to find out for each query how many hits has got, In our study we have used Human,Mouse and Elephant proteins and need only three hits from blast out.
sort Best_blast_hit.txt |awk '{print$1}'|sort | uniq -c | sed 's/^ *//'|grep "^3" |awk '{print $2}' > 3hits.txt
From the above list, need to find the IDs of each species by comparing to the Best_blast_hit.txt
grep -wFf 3hits.txt Best_blast_hit.txt |awk '{print$1,$2}' >Ortho_Ids.txt
For the next step we need to validate these orthologous with ensemble databases and you can download the orthologous Ids from the ensembl biomart :https://www.ensembl.org/biomart.
Use the script Match_Gene_From_Ensembel.awk for this purpose. This script will use the Ortho_Ids.txt and ortho_ensemble_downloaded (From Ensembl).
Syntax awk Match_Gene_From_Ensembel.awk >Ensemble_Orthos.txt
Next need to filter above output with number of orthologs with count 3 (3 species)
sort Ensemble_Orthos.txt|awk '{print$1}'|uniq -c|sed 's/^ *//'|grep "^3" >Ens_Orthos_3_species.txt
And will grep the Ids
grep -wFf Ens_Orthos_3_species.txt Ensemble_Orthos.txt > Orthologous.txt
These above all steps will run for both sets of Dugong and manatee and finally both orthologous will merge with Merge_orthologus.awk script
#Codon based Alignment
For aligning the multiple species orthologous we have used macse codon based aligner with following options
java -jar macse_v2.03.jar -prog alignSequences -seq Input.fasta
This tool will generate alignment at the NT and AA levels
For removing the internal stop codons we have used exportAlignment function of macse
java -jar macse_v2.04.jar -prog exportAlignment -align Input.fas -codonForInternalStop NNN
#Codeml run
For running codeml we have removed all internal stop codons from the alignment. The assessment of each gene was done with the perl script called kaks_codeml_lineage.pl
You need to provide the tree files (phylip format) and the codeml control files to the CodeMLPairwise.pm and kaks_codeml_lineage.pl scripts .
Syntax :
for i in `ls input_folder`; do perl kaks_codeml_lineage.pl $i ./output/`echo $i | sed 's/.fasta/codeml_0.txt/g'` 2> ./output/`echo $i | sed 's/.fasta/codeml_0.txt.e/g'`; done
The output folder will have model and error files
You need to run the script for model0 and model 2(Seacow as forward branch) and compare the models with likelihood ratio test
awk 'BEGIN {while ( getline < ARGV[1] > 0 ){extradata[$1]=$0} ARGV[1]="";} {OFS="\t"; print extradata[$1], $0}' results_model0.txt results_model2.txt | awk 'BEGIN {OFS="\t"; print "Gene_ID", "seq_length", "lnL_model0", "omega_model0", "lnL_model2", "omega_background", "omega_COL", "LRT", "Trend"} {gsub("lnL:",""); gsub(".fasta",""); if ($34 > $33) {print $1, $2, $4, $16, $20, $33, $34, 2*($20-$4), "faster"} else {print $1, $2, $4, $16, $20, $33, $34, 2*($20-$4), "slower"}}' | awk '$(NF-1) >3.84' > genes_sign_different.txt
#Gene Ontology (GO) annotation
For gene ontology analysis we used FUNC, for this ensembl gene ids of the genes to test are needed. Using the script mapping_common.py 4877 genes with orthos are mapped to the GO names (Gene Ontology (GO) annotation from Ensembl v98)
The input data sets are divided into two
1. Stop genes
for i in `awk '{print $1}' stops_gene_ID.txt`; do grep -w $i go_mapped.txt | awk 'BEGIN {OFS=FS="\t"}; {print $2,$7,"1"}'| awk 'NF==3'; done > forward_genes.txt
2. Background genes
for i in `awk '{print $1}' stops_gene_ID.txt`; do grep -w -v $i go_mapped.txt | awk 'BEGIN {OFS=FS="\t"}; {print $2,$7,"0"}'| awk 'NF==3'; done > background_genes.txt
Prepare Input for FUNC
cat forward_genes.txt background_genes.txt > input_stop_codon_func.txt
Run Func
func_hyper -i input_stop_codon_func.txt -t go_monthly-termdb-tables -o output-directory
Find significant groups
To find the significant groups use the script get.sig.go_Func.sh. It will take the groups.txt in the out put folder of FUNC and generate significant_groups.txt
Next will Collect GO cats to identify the genes
input_to_go_genes.pl go_monthly-termdb-tables/term.txt go_monthly-termdb-tables/graph_path.txt input_stop_codon_func.txt > GO_cats_for_input_stop_codon_func.txt
Get the genes for the significant GO terms
for i in `cat significant_groups.txt | cut -f 3 | grep GO`; do echo $i; grep $i GO_cats_for_input_stop_codon_func.txt | cut -f 1 > GO.txt; for j in `cat GO.txt`; do grep $j input_stop_codon_func.txt |awk '$3==1' | cut -f 1 | sort | uniq | tr "\n" ", "; grep $j stops_gene_ID.txt | cut -f 2 | tr "\n" "; "; done; rm GO.txt; done | sed 's/;GO:/\nGO:/g' > GO_genes.txt
Arrange it on two columns
xargs -n2 < GO_genes.txt
#Population diversity estimates
We used Consensify to generate a consensus pseudohaploid genome sequence.
Consensify takes 3 files as input counts,pos and lengths.The .counts and .pos files can be generated from a standard bam file using the -doCounts function in angsd
ngsd -doCounts 1 -dumpCounts 3 -rf scaffold_lengths.txt -i Input.bam -out out_name
Unzip the count/pos files
gunzip -k out_name.counts.gz
gunzip -k out_name.pos.gz
Make Scaffold lenght file containing a 2 column table: scaffold name, length (Tab)
Run Consensify
Here maximum read depth filter can set as 30 (i.e. only consider positions covered by 30 reads or fewer), You can change that depends upon the alignment score
perl Consensify-master/Consensify_v0.1.pl out_name.pos out_name.counts scaffold_lengths.txt out_coverage30.fasta 30
Consensify output were splitted into each scaffolds by using following commands
faSplit byname scaffolds.fa outfolder/
Pairwise comparisons of individuals dated to the same time period
For pairwise comparison two individuals scaffolds were concatenated as like this
cat individual01/scaffold100.fa individual02/scaffold100.fa > For_Consensify/scaffold100.fa
Then count the number of differences within non overlapping blocks of 50 kb by using Pairwise_comparison.py script as follows
FILES=For_Consensify/*
for f in $FILES
do
python Pairwise_comparison.py $f > "$f"_.count
done