-
Notifications
You must be signed in to change notification settings - Fork 0
/
Tutorial.Rmd
311 lines (228 loc) · 7.84 KB
/
Tutorial.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
---
title: "GCAT Base Tutorial"
author: Markus Gumbel
date: "`r format(Sys.time(), '%d. %B %Y')`"
params:
devel: TRUE
output:
html_document:
toc: true
number_sections: true
---
_under construction_
```{r include=params$devel, eval=params$devel}
devtools::load_all(".")
```
```{r include=!params$devel, eval=!params$devel}
library(gcatbase)
```
# Sequence, tuple and codon representations
Note: a tuple or a codon is considered to be a special case of a nucleotide sequence.
## Creation of tuples from an alphabet (`all_tuples`)
The function `all_tuples` creates all tuples. It expects the tuple size `tsize` and an alphabet $\Sigma$.
(`rextendr` does not have parameter names, does it?)
```{r}
sigma = c("+", "-") # Alphabet of two letters
tuples = all_tuples(tsize = 2, alphabet = sigma)
print(tuples)
```
Codons can be created with the `all_tuples` this way. By default, the alphabet are DNA bases.
```{r}
codons = all_tuples(3)
print(codons)
```
## The alphabet of tuples or a sequence (`alphabet`)
`alphabet` determines for a sequence or a vector of sequences the underlying alphabet including its type.
The type `type` can be "RNA" or "DNA" or "unkown".
```{r}
alphabet1 = alphabet("AUGC") # one sequence
print(alphabet1$letters)
print(alphabet1$type)
print(alphabet1)
```
```{r}
alphabet2 = alphabet(c("ATG", "CTT", "GAG", "CCG")) # vector
print(alphabet2)
```
```{r}
sample.code <- gcatbase::code(c("ATG", "CTT", "GAG", "CCG"), id="sample.code")
alphabet2 = alphabet(sample.code) # vector
print(alphabet2)
```
## Normalization of nucleotide sequences (`normalize`)
There are many ways to represent nucleotide sequences and codons or tuples in general:
* DNA or RNA?
* letters in lower- or upper-case?
* Sequences as string or as vector of characters?
`gcatbase` offers functions to convert the representation. The default representation is DNA in upper-case.
`normalize` converts a sequence or a vector of sequences (e.g. codons) to a specified representation.
```{r}
rna_codons = normalize(codons, RNA = TRUE, lowercase = TRUE)
print(rna_codons)
```
Of course, we can only normalize a sequence as well:
```{r}
seq = "ATCATGCCCACGGCGCGGAGGATAGCGAGCAGGT"
seqrna = normalize(seq, RNA = TRUE) # Uppercase is default
print(seqrna)
```
# Codes
## Create codes (`code`)
Codes are created with the `code` function. If the tuples passed to `code` are not unique they are made unique.
```{r}
# Non unique tuple set:
X = code(tuples = c("ATG", "GCG", "ATG"), id = "A code")
```
This will return an object of type `gcat.code`. The attributes are:
* `id`: a short identifier for this code.
* `tuples`: unique vector of tuples.
* `tsize`: tuple size (e.g. 3 if codons)
Note that the code is now unique. The elements are sorted.
```{r}
print(X)
```
In particular for codons the `summary` function is available.
```{r}
print(summary(X))
```
It is also possible to create a code from a single sequence that is decomposed into tuples by means of the `split` function. `split` is explained in more detail below.
## Random codes (`random_code`)
It is also possible to create random codes.
```{r}
Xr = random_code(size = 10, tsize = 4)
print(Xr)
```
## Persisting codes in a text file (`write_codes` and `read_codes`)
Codes can be written to and read from a file.
```{r}
write_codes(filename = "codes.txt", codes = list(X, Xr), header = "# My codes file")
```
This will create a file `codes.txt` with the following content:
```{r echo=F, comment = ""}
s = readLines("codes.txt")
cat(s[1], "\n", s[2], "\n", s[3])
```
```{r}
codes = read_codes("codes.txt")
print(codes)
```
# Manipulation and analysis of sequences, tuples and codons
## Tuples from a sequence (`split`, `tuples_usage`)
```{r}
print(seq)
t = gcatbase::split(seq, 3)
print(t)
```
Next we want to calculate the tuple usage. `tuples_usage` creates a data frame.
```{r}
u = tuples_freq(t)
print(head(u, 10))
```
Make a code from the sequence. Note that `code` makes the tuples unique:
```{r}
X2 = code(t, "Code from seq.")
print(X2)
```
`split` can also be used to create a code from a sequence.
```{r}
X3 = code(split("AUGUGAGC", tsize = 3), id = "Code from sequence")
print(X3)
```
Another argument for `split` is `sep`. It can only be used when `tsize` is ommited. The sequence is splitted according to the value. This is useful when codons (or tuples) are copied and pasted from a text.
```{r}
X5 = code(split("AUG,UUG,GCG", sep = ","), id = "Code from separator")
print(X5)
```
## Shift tuples (`shift`)
```{r}
X2s = code(shift(X2, 1), "shifted code")
print(X2s)
```
## Reverse and complementary tuples (`compl`, `rev_compl`)
The complementary tuples are for instance:
```{r}
print(t)
print(compl(t))
```
The anticodons of the tuples in `t` are:
```{r}
print(rev_compl(t))
```
## Classify sequence according to a code (`classify`, `code_usage`)
```{r}
h = classify(seq, X)
print(h)
```
From this we can derive the code usage $u$ (in %)
```{r}
u = length(h[h == 1]) / length(h)
print(u)
```
or use the convenient function `code_usage`:
```{r}
u = code_usage(seq, X)
print(u)
```
# Amino acids (`amino_acids`)
We can translate a vector of codons into amino acids:
```{r}
print(amino_acids(t))
```
Also, we can ask what amino acids are used by a code. Here the Vertebrate Mitochondrial Code (transl_table = 2) is used:
```{r}
print(amino_acids(X, numcode = 2)) # generic for amino_acids(X$tuples)
```
Furthermore, the data frame `aa_prop` contains some important chemical properties of amino acids as published in Haig, D. and Hurst, L.D. (1991).
```{r}
print(aa_prop)
```
The columns are according to the paper:
* `Polar` "Woese et al.'s (1966) polar requirement is the slope of the line that results when $$\log(1 -
RF)/RF$$ for free amino acids is plotted against the log mole fraction of water in pyridine solvent."
* `Hydropathy` "Kyte and Doolittle's (1982) hydropathy is based on water-vapor transfer free energies, the interior-exterior distribution of amino acid side-chains, and the subjective judgment of the authors"
* `Volume` "Grantham's (1974) molecular volume of side chains is the residue volume minus a constant peptide volume. "
* `Isoelectric` "The values of isoelectric points were taken from Alff-Steinberger (1969)."
# Operations over lists (`flatten`)
Sometimes we need to perform some operations over entries in a Fasta file. For instance, we want to count the tuples in a Fasta record. First, the file is read and converted into GCAT's default representation:
```{r}
fasta = seqinr::read.fasta("../inst/extdata/ena-sars.fasta", as.string = TRUE, forceDNAtolower = FALSE)
```
A typical question is the frequencies of all codons in the Fasta file. The following code calculates the frequencies:
```{r}
# Generic iteration
ntuples = sapply(fasta, function(f) {
seq = as.character(f)
gcatbase::split(seq) # individual operation
})
tuples = unlist(ntuples)
# Now the analysis:
freq = tuples_freq(tuples)
print(head(freq))
```
Except of the individual operation `gcatbase::split(seq)` the iteration is generic and can be made available as a pattern.
This pattern is implemented in GCAT's `flatten` function. The first paramter is the list (here the Fasta file) and the second the function (operator) to be applied (here the split function).
```{r}
codons = gcatbase::flatten(fasta, gcatbase::split)
freq = tuples_freq(codons)
print(head(freq))
```
Of course, this is also possible for quadrupels in frame 1.
```{r}
op = function(seq) {
seq1 = truncate(seq, 1) # Frame 1
gcatbase::split(seq1, 4) # Make quadruples.
}
tuples4 = gcatbase::flatten(fasta, op)
freq = tuples_freq(tuples4)
print(head(freq))
```
# Appendix
Also possible for a code and compatible to `rextendr`.
```{r eval=F}
X = Code$new(c("AUG"), id = "Foo")
X$tuples
X$tsize
X$alphabet
```
# References
Haig, D. and Hurst, L.D. (1991) ‘A quantitative measure of error minimization in the genetic code’, Journal of Molecular Evolution, 33(5), pp. 412–417. doi:10.1007/BF02103132.