example/Tutorial.Rmd

---
title: "GCAT Base Tutorial"
author: Markus Gumbel
date: "`r format(Sys.time(), '%d. %B %Y')`"
params:
  devel: TRUE
output:
 html_document:
  toc: true
  number_sections: true 
---

_under construction_

```{r include=params$devel, eval=params$devel}
devtools::load_all(".")
```

```{r include=!params$devel, eval=!params$devel}
library(gcatbase)
```

# Sequence, tuple and codon representations

Note: a tuple or a codon is considered to be a special case of a nucleotide sequence.

## Creation of tuples from an alphabet (`all_tuples`)

The function `all_tuples` creates all tuples. It expects the tuple size `tsize` and an alphabet $\Sigma$.

(`rextendr` does not have parameter names, does it?)

```{r}
sigma = c("+", "-") # Alphabet of two letters
tuples = all_tuples(tsize = 2, alphabet = sigma)
print(tuples)
```

Codons can be created with the `all_tuples` this way. By default, the alphabet are DNA bases.

```{r}
codons = all_tuples(3)
print(codons)
```

## The alphabet of tuples or a sequence (`alphabet`)

`alphabet` determines for a sequence or a vector of sequences the underlying alphabet including its type.
The type `type` can be "RNA" or "DNA" or "unkown".

```{r}
alphabet1 = alphabet("AUGC") # one sequence
print(alphabet1$letters)
print(alphabet1$type)
print(alphabet1)
```

```{r}
alphabet2 = alphabet(c("ATG", "CTT", "GAG", "CCG")) # vector
print(alphabet2)
```

```{r}
sample.code <- gcatbase::code(c("ATG", "CTT", "GAG", "CCG"), id="sample.code")
alphabet2 = alphabet(sample.code) # vector
print(alphabet2)
```

## Normalization of nucleotide sequences (`normalize`)

There are many ways to represent nucleotide sequences and codons or tuples in general:

 * DNA or RNA?
 * letters in lower- or upper-case?
 * Sequences as string or as vector of characters?

 `gcatbase` offers functions to convert the representation. The default representation is DNA in upper-case.

`normalize` converts a sequence or a vector of sequences (e.g. codons) to a specified representation.
```{r}
rna_codons = normalize(codons, RNA = TRUE, lowercase = TRUE)
print(rna_codons)
```

Of course, we can only normalize a sequence as well:

```{r}
seq = "ATCATGCCCACGGCGCGGAGGATAGCGAGCAGGT"
seqrna = normalize(seq, RNA = TRUE) # Uppercase is default
print(seqrna)
```

# Codes

## Create codes (`code`)

Codes are created with the `code` function. If the tuples passed to `code` are not unique they are made unique.

```{r}
# Non unique tuple set:
X = code(tuples = c("ATG", "GCG", "ATG"), id = "A code")
```

This will return an object of type `gcat.code`. The attributes are:

 * `id`: a short identifier for this code.
 * `tuples`: unique vector of tuples.
 * `tsize`: tuple size (e.g. 3 if codons)

Note that the code is now unique. The elements are sorted.
```{r}
print(X)
```

In particular for codons the `summary` function is available.

```{r}
print(summary(X))
```

It is also possible to create a code from a single sequence that is decomposed into tuples by means of the `split` function. `split` is explained in more detail below.

## Random codes (`random_code`)

It is also possible to create random codes.

```{r}
Xr = random_code(size = 10, tsize = 4)
print(Xr)
```

## Persisting codes in a text file (`write_codes` and `read_codes`)

Codes can be written to and read from a file.

```{r}
write_codes(filename = "codes.txt", codes = list(X, Xr), header = "# My codes file")
```

This will create a file `codes.txt` with the following content:
```{r echo=F, comment = ""}
s = readLines("codes.txt")
cat(s[1], "\n", s[2], "\n", s[3])
```

```{r}
codes = read_codes("codes.txt")
print(codes)
```


# Manipulation and analysis of sequences, tuples and codons

## Tuples from a sequence (`split`, `tuples_usage`)

```{r}
print(seq)
t = gcatbase::split(seq, 3)
print(t)
```

Next we want to calculate the tuple usage. `tuples_usage` creates a data frame.
```{r}
u = tuples_freq(t)
print(head(u, 10))
```

Make a code from the sequence. Note that `code` makes the tuples unique:
```{r}
X2 = code(t, "Code from seq.")
print(X2)
```

`split` can also be used to create a code from a sequence.

```{r}
X3 = code(split("AUGUGAGC", tsize = 3), id = "Code from sequence")
print(X3)
```

Another argument for `split` is `sep`. It can only be used when `tsize` is ommited. The sequence is splitted according to the value. This is useful when codons (or tuples) are copied and pasted from a text.

```{r}
X5 = code(split("AUG,UUG,GCG", sep = ","), id = "Code from separator")
print(X5)
```

## Shift tuples (`shift`)

```{r}
X2s = code(shift(X2, 1), "shifted code")
print(X2s)
```

## Reverse and complementary tuples (`compl`, `rev_compl`)

The complementary tuples are for instance:
```{r}
print(t)
print(compl(t))
```

The anticodons of the tuples in `t` are:

```{r}
print(rev_compl(t))
```

## Classify sequence according to a code (`classify`, `code_usage`)

```{r}
h = classify(seq, X)
print(h)
```


From this we can derive the code usage $u$ (in %)
```{r}
u = length(h[h == 1]) / length(h)
print(u)
```

or use the convenient function `code_usage`:

```{r}
u = code_usage(seq, X)
print(u)
```

# Amino acids (`amino_acids`)

We can translate a vector of codons into amino acids:
```{r}
print(amino_acids(t))
```

Also, we can ask what amino acids are used by a code. Here the Vertebrate Mitochondrial Code (transl_table = 2) is used:
```{r}
print(amino_acids(X, numcode = 2)) # generic for amino_acids(X$tuples)
```

Furthermore, the data frame `aa_prop` contains some important chemical properties of amino acids as published in Haig, D. and Hurst, L.D. (1991). 
```{r}
print(aa_prop)
```

The columns are according to the paper:

 * `Polar` "Woese et al.'s (1966) polar requirement is the slope of the line that results when $$\log(1 -
RF)/RF$$ for free amino acids is plotted against the log mole fraction of water in pyridine solvent."
 * `Hydropathy` "Kyte and Doolittle's (1982) hydropathy is based on water-vapor transfer free energies, the interior-exterior distribution of amino acid side-chains, and the subjective judgment of the authors"
 * `Volume` "Grantham's (1974) molecular volume of side chains is the residue volume minus a constant peptide volume. "
 * `Isoelectric` "The values of isoelectric points were taken from Alff-Steinberger (1969)."

# Operations over lists (`flatten`)

Sometimes we need to perform some operations over entries in a Fasta file. For instance, we want to count the tuples in a Fasta record. First, the file is read and converted into GCAT's default representation:
```{r}
fasta = seqinr::read.fasta("../inst/extdata/ena-sars.fasta", as.string = TRUE, forceDNAtolower = FALSE)
```

A typical question is the frequencies of all codons in the Fasta file. The following code calculates the frequencies:

```{r}
# Generic iteration
ntuples = sapply(fasta, function(f) {
  seq = as.character(f)
  gcatbase::split(seq) # individual operation
})
tuples = unlist(ntuples)

# Now the analysis:
freq = tuples_freq(tuples)
print(head(freq))
```

Except of the individual operation `gcatbase::split(seq)` the iteration is generic and can be made available as a pattern.

This pattern is implemented in GCAT's `flatten` function. The first paramter is the list (here the Fasta file) and the second the function (operator) to be applied (here the split function).

```{r}
codons = gcatbase::flatten(fasta, gcatbase::split)
freq = tuples_freq(codons)
print(head(freq))
```

Of course, this is also possible for quadrupels in frame 1.
```{r}
op = function(seq) {
  seq1 = truncate(seq, 1) # Frame 1
  gcatbase::split(seq1, 4) # Make quadruples.
}

tuples4 = gcatbase::flatten(fasta, op)
freq = tuples_freq(tuples4)
print(head(freq))
```

# Appendix

Also possible for a code and compatible to `rextendr`.
```{r eval=F}
X = Code$new(c("AUG"), id = "Foo")
X$tuples
X$tsize
X$alphabet
```

# References

Haig, D. and Hurst, L.D. (1991) ‘A quantitative measure of error minimization in the genetic code’, Journal of Molecular Evolution, 33(5), pp. 412–417. doi:10.1007/BF02103132.