Skip to content

Commit 8e2e062

Browse files
Fleur Gawehns-BruningFleur Gawehns-Bruning
Fleur Gawehns-Bruning
authored and
Fleur Gawehns-Bruning
committed
move all bash files to src
1 parent f57086a commit 8e2e062

27 files changed

+9718
-40
lines changed

README.md

+293-17
Original file line numberDiff line numberDiff line change
@@ -1,26 +1,302 @@
1-
## HFC-permutation pipeline
1+
# Pipeline is not tested for production yet!
22

3-
#### Authors: Fleur Gawehns, Veronika Laine
3+
Contents
4+
========
45

5-
------------------------------------------
6+
* [How to use this file](#how-to-use-this-file)
7+
* [Introduction](#introduction)
8+
* [Login on the server](#login-on-the-server)
9+
* [Copying the pipeline](#copying-the-pipeline)
10+
* [Directories](#directories)
11+
* [Prerequisites](#prerequisites)
12+
* [Start the pipeline](#start-the-pipeline)
13+
* [More Reading](#more-reading)
614

7-
### How to run the pipeline
8-
```bash
9-
# replace <user-name> by your own NIOO login-name, e.g FleurG
10-
git clone https://<user-name>@gitlab.bioinf.nioo.knaw.nl/pipelines/HFC-permutation.git
1115

12-
# create a conda environment for plink2
13-
conda create -n plink2
14-
source activate plink2
15-
conda install -c bioconda plink2=1.90b3.35
16-
conda install -c bioconda parallel
16+
How to use this file
17+
---------------------
1718

18-
cd HFC-permutation
19+
This README file gives a global introduction on how to work on the server, cloning the pipeline repository and how to run the pipeline. We tried to be as complete and precise as possible.
20+
If you have any question, comment, complain or suggestion or if you encounter any conflicts or errors in this document or the pipeline, please contact your Bioinformatics Unit ([email protected]) or open an `Issue`!
1921

20-
# put your ped and map file in the data directory
21-
# execute the pipeline
22-
./script-gnu.sh
22+
###### Enjoy your analysis and happy results!
2323

24-
# In the end the script concanates the output of each sampling together. You can find the output in 'results'
24+
```
25+
Text written in boxes is code, which usually can be executed in your Linux terminal. You can just copy/paste it.
26+
Sometimes it is "special" code for R or any other language. If this is the case, it will be explicitly mentioned in the instructions.
27+
```
28+
29+
`Text in a red box indicates directory or file names`
30+
31+
Text in brackets "<>" indicates that you have to replace it and the brackets by your own appropiate text.
32+
33+
Introduction
34+
------------
35+
36+
With this pipeline you can analyse Illumina single-end reads from transcriptomics data of microbes. This instruction is adjusted to run a metagenomics mode but can be easily modified to run a single microbe, too.
37+
38+
* Fastqc quality check
39+
* Read mapping using Bowtie2 (more options available?)
40+
* Transcript quantification using [RSEM](http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-12-323)
41+
* Differential Expression Analysis with [TrinityEmpirical](https://github.com/trinityrnaseq/trinityrnaseq/wiki/Trinity-Differential-Expression). You can choose between different methods:
42+
* Analysis of Digital Gene Expression Data in R, [EdgeR](https://bioconductor.org/packages/release/bioc/html/edgeR.html)
43+
* [DESeq2](http://bioconductor.org/packages/release/bioc/html/DESeq2.html)
44+
* [limma/voom](http://bioconductor.org/packages/release/bioc/html/limma.html)
45+
* [ROTS](http://www.btk.fi/research/research-groups/elo/software/rots/)
46+
47+
The references used in this pipeline rely on the output of the prokka annotation pipeline [here](https://gitlab.bioinf.nioo.knaw.nl/pipelines/prokka) and optionally on the cog annotation pipeline [here](https://gitlab.bioinf.nioo.knaw.nl/pipelines/cog-assign.git). You have to run these pipelines before you can start your transcriptome analysis. However, this document wille xplain how to concanate and transform the prokka and cog output to meet the requirements for this pipeline.
48+
49+
Login on the server
50+
------------------
51+
##### Your UserID:
52+
53+
userID = your NIOO ID (e.g. fleurg) and password.
54+
55+
##### Make the connection:
56+
57+
for Mac type in the Terminal:
58+
59+
```
60+
61+
```
62+
63+
For Windows PC login via PuTTy or MobaXterm.
64+
65+
After login you are located in your home folder:
66+
67+
`/home/NIOO/<userID>`
68+
69+
##### Enter your project directory
70+
71+
```
72+
cd <your>/<project>/<directory>
73+
```
74+
75+
Copying the pipeline
76+
------------------
77+
78+
To start a new analysis project based on this pipeline, follow the following steps:
79+
80+
- Clone and rename the pipeline-skeleton from our GitLab server by typing in the terminal:
81+
82+
```
83+
git clone [email protected]:pipelines/transcriptomics-microbes.git
84+
```
85+
86+
- Enter `transcriptomics-microbes`
87+
88+
```
89+
cd transcriptomics-microbes
90+
```
91+
92+
Directories
93+
--------------------------
94+
95+
##### The toplevel `README` file
96+
97+
This file contains general information about how to run this pipeline
98+
99+
##### The `data` directory
100+
101+
Used to place `samples_contrasts.txt` and `samples_described.txt`.
102+
Contains the subdirectories `ref` and `reads`.
103+
104+
##### The `data/ref` directory
105+
106+
Should contain the concatenated *.ffn files from each single prokka annotation. Even if a species contains of multiple chromosomes, concanate them all. Ensure the ID's are consistent with the locus tags.
107+
108+
##### The `data/reads` directory
109+
110+
This directory contains a link to your raw data (#test if a link will work)
111+
112+
##### The `analysis` directory
113+
114+
This directory will contain all the resuts from this pipeline
115+
116+
##### The `src` directory
117+
118+
Custom scripts are stored here
119+
120+
121+
Prerequisites
122+
------------------
123+
124+
Snakemake runs under its own virtual environment. If you do not have a snakemake virtual environment, create one:
25125

26126
```
127+
# source /data/tools/miniconda/4.2.12/env.sh, on nioo0002 miniconda is installed globally. You do not have to source it then.
128+
conda create -n snake
129+
source activate snake
130+
# conda install snakemake, snakemake is also already installed globally on nioo0002. You do not have to install it again.
131+
conda install biopython
132+
```
133+
134+
if you already have the environment snake, than activate it by
135+
136+
```
137+
source /data/tools/miniconda/4.2.12/env.sh
138+
source activate snake
139+
```
140+
141+
You can deactivate the environment again with
142+
143+
```
144+
source deactivate
145+
```
146+
147+
The Transcriptomics-pipeline requires a number of installes tools:
148+
149+
* bowtie2
150+
* bowtie
151+
* bwa
152+
* kallisto
153+
* fastqc
154+
* ea-utils
155+
* trinity
156+
157+
Fortunately, those packages are already installed but you have to source them:
158+
159+
```
160+
source env.sh
161+
```
162+
163+
Start the pipeline
164+
------------------
165+
166+
1) Activate the snake environment (see [Prerequisites](#prerequisites) for detailed instructions) and source the env.sh file if you have not done yet.
167+
168+
```
169+
source /data/tools/miniconda/4.2.12/env.sh
170+
source activate snake
171+
source env.sh
172+
```
173+
174+
2) Prepare the reference from prokka output. Concanate all relevant *.ffn files using
175+
176+
```
177+
cat <path>/<to>/prokka/analysis/<file1>.ffn <path>/<to>/prokka/analysis/<file2>.ffn <path>/<to>/prokka/analysis/<...>.ffn > data/ref/reference.fasta
178+
```
179+
180+
Ensure the ID's are consistent with the locus tags:
181+
182+
i.e.:
183+
184+
```
185+
>AD56_00005 Transcriptional repressor NrdR
186+
ATGCATTGCCCTTTCTGCCAGCACGAAGACACCCGCGTGATCGACTCACGCCTGACCGAG
187+
```
188+
189+
3) Place a link to your raw read files.
190+
191+
example:
192+
```
193+
ln -s <path>/<to>/<your>/<raw>/<reads> ./data/reads
194+
```
195+
196+
4) Edit the config.json file to choose your aligner and define your `data` files.
197+
198+
```
199+
nano ./doc/config.json.template
200+
```
201+
202+
example config:
203+
204+
```
205+
{
206+
"threads" : 32,
207+
"aligners" : "bowtie2",
208+
"reference" : "bAD24_gpAD87.cleaned.fasta",
209+
"kallisto" : "/data/tools/kallisto/default/bin/kallisto",
210+
"samples" : ["1",
211+
"2",
212+
"3",
213+
"4",
214+
"5",
215+
"6",
216+
"7",
217+
"8",
218+
"9"
219+
],
220+
"data": {
221+
"T4-2-1": { "forward" : ["./data/reads/LN344/I16-1385-01-t4_2-1_S1_L001_R1_001.fastq.gz"]},
222+
"T4-2-2": { "forward" : ["./data/reads/LN344/I16-1385-01-t4_2-1_S1_L001_R1_001.fastq.gz"]}
223+
}
224+
}
225+
```
226+
227+
228+
Close nano using ctrl+x. When saving, remove the current name (doc/config.json.template) and replace it by typing `config.json`. The config file is now saved in your main directory.
229+
230+
231+
5) Make sure you are in the `transcriptomics-microbe` folder and run the RNAseq pipeline by:
232+
233+
```
234+
snakemake -n
235+
snakemake -j 6 (24 cores max)
236+
```
237+
238+
6) Concanate the expression output with the prokka and cog annotations:
239+
240+
First change the prokka .gbk output for every single reference file that you used in the RNAseq pipeline to a tabular format by
241+
242+
```
243+
./src/VDJ_prokka_gbk_to_txt.py -g <path>/<to>/prokka/analysis/<file1>.gbk -o ./analysis/<file1>.tsv
244+
```
245+
246+
and concanate the output using
247+
248+
```
249+
cat analysis/<file1>.tsv analysis/<file2>.tsv ... > ./data/ref/reference.tsv
250+
```
251+
252+
The file has to be modified to meet the requirement of the R script later on:
253+
254+
```
255+
cat ./data/ref/reference.tsv | cut -f1 -d " " > ./data/ref/id.txt
256+
paste ./data/ref/id.txt ./data/ref/reference.tsv > ./data/ref/reference-id.tsv
257+
```
258+
259+
Optional: Run the [cog-annotation pipeline](https://gitlab.bioinf.nioo.knaw.nl/pipelines/cog-assign.git) if you have not done yet. Concanate the output of more reference in one single file:
260+
261+
```
262+
cat <path>/<to>/cog-assign/analysis/<file1>.faa.tab <path>/<to>/cog-assign/<file2>.faa.tab ... > ./data/ref/reference.faa.tab
263+
```
264+
265+
Run the following R code in R or R studio to merge three files (reference.tsv, reference.faa.tab, diffExpr.P<yourPvalue>_C0.585.matrix):
266+
267+
Read the annotation file generated with the python script
268+
269+
```R
270+
annotation <- read.delim("./data/ref/reference-id.tsv", header=FALSE)
271+
colnames(annotation)<-c("contig", "id", "start", "stop", "strand", "description")
272+
273+
# Optionally attach also the COG annotation:
274+
275+
cog <-read.delim2("./data/ref/reference.faa.tab", header=FALSE, sep="\t", flush=TRUE)
276+
colnames(cog)<-c("id", "description", "cog", "cogdescription", "class", "classdescription")
277+
278+
# This generates the following annotation project:
279+
280+
annotation <- merge(annotation, cog, by="id", all.x=TRUE)
281+
282+
# Read the EdgeR expression data as generated by the workflow and add a column "ID":
283+
284+
diffExpr.P<yourPvalue>_C0.585 <- read.delim("/mnt/data/ngs/ME/garbeva_group/ruths/illumina-microbial-transcriptomics/analysis/rsem_matrix/edgeR/diffExpr.P<yourPvalue>_C0.585.matrix")
285+
colnames(diffExpr.<yourPvalue>_C0.585)[1]<-c("id")
286+
287+
# Merge the annotation and write a file to the EdgeR_matrix folder:
288+
289+
diffExpr.P<yourPvalue>_C0.585 <- merge(diffExpr.<yourPvalue>_C0.585, annotation, by='id', all.x=TRUE)
290+
write.table(diffExpr.P0.<yourPvalue>_C0.585, file = "./analysis/rsem_matrix/edgeR/diffExpr.P<yourPvalue>_C0.585.annotated.matrix", sep="\t", col.names=TRUE)
291+
```
292+
293+
More Reading
294+
------------------
295+
296+
[Kallisto](https://pachterlab.github.io/kallisto/ http://pachterlab.github.io/sleuth)
297+
298+
[https://www.biostars.org/p/143458/#157303](https://www.biostars.org/p/143458/#157303)
299+
300+
About the output use [Trinity](https://github.com/trinityrnaseq/trinityrnaseq/wiki/Trinity-Differential-Expression)
301+
302+

0 commit comments

Comments
 (0)