S4.Rmd

# S4

```{r setup, include = FALSE}
source("common.R")

# Hide annoying output
setMethod <- function(...) invisible(methods::setMethod(...))
setGeneric <- function(...) invisible(methods::setGeneric(...))
```

Like S3, S4 implements functional OOP, but is much more rigorous and strict. There are three main differences between S3 and S4:

* S4 classes have formal definitions provided by a call to `setClass()`.
  An S4 class can have multiple parents (multiple inheritance).
  
* The fields of an S4 object are not attributes or named elements, but 
  instead are called __slots__ and are accessed with the special `@` operator.
  
* Methods are not defined with a naming convention, but are instead
  defined by a call to `setMethod()`. S4 generics can dispatch on multiple
  arguments (multiple dispatch).

A good overview of the motivation of S4 and its historical context can be found in @chambers-2014, <https://projecteuclid.org/download/pdfview_1/euclid.ss/1408368569>.

S4 is a rich system, and it's not possible to cover all of it in one chapter. Instead, we'll focus on what you need to know to read most S4 code, and write basic S4 components. Unfortunately there is not one good reference for S4 and as you move towards more advanced usage, you will need to piece together needed information by carefully reading the documentation and performing experiments. Some good places to start are:

* [Bioconductor course materials][bioc-courses], a list of all courses
  taught by Bioconductor, a big user of S4. One recent (2017) course by Martin
  Morgan and Hervé Pagès is [S4 classes and methods][bioc-s4-class].

* [S4 questions on stackoverflow][SO-Morgan] answered by Martin Morgan.

* [_Software for Data Analysis_][S4DA], a book by John Chambers.

All S4 related functions live in the methods package. This package is always available when you're running R interactively, but may not be available when running R in batch mode (i.e. from `Rscript`). For this reason, it's a good idea to call `library(methods)` whenever you use S4. This also signals to the reader that you'll be using the S4 object system.

```{r}
library(methods)
```

## Classes

Unlike S3, S4 classes have a formal definition. To define an S4 class, you must define three key properties:

* The class __name__. By convention, S4 class names use UpperCamelCase.

* A named character vector that describes the names and classes of the 
  __slots__ (fields). For example, a person might be represented by a character 
  name and a numeric  age: `c(name = "character", age = "numeric")`. The 
  pseudo-class "ANY" allows a slot to accept objects of any type. \index{slots}

* The name of a class (or classes) to inherit behaviour from, or in S4 
  terminology, the classes that it __contains__. 

Slots and contains can specify the names of S4 classes, S3 classes (if registered), and base types. We'll go into more detail about non-S4 classes at the end of the chapter, in [S4 and existing code].

To create a class, you call `setClass()`, supplying these three properties. Lets make this concrete with an example. Here we create two classes: a person with character `name` and numeric `age`, and an `Employee` that inherits slots and methods from `Person`, adding an additional `boss` slot that must be a `Person`. `setClass()` returns a low-level constructor function, which should be given the class name with a `.` prefix.

```{r, cache = FALSE}
.Person <- setClass("Person", 
  slots = c(
    name = "character", 
    age = "numeric"
  )
)
.Employee <- setClass("Employee", 
  contains = "Person", 
  slots = c(
    boss = "Person"
  )
)
```

`setClass()` has 10 other arguments, but they are all either deprecated or not recommended. If you have existing S4 code that uses them, I'd recommend carefully reading the documentation and upgrading to modern practice.

We can now use the constructor to create an object from that class:

```{r}
hadley <- .Person(name = "Hadley", age = 37)
hadley
```

It's also possible to create an instance using `new()` and the name of the class. This is not recommended because it introduces some ambiguity. What happens if there are two packages that both define the `Person` class?

```{r}
hadley2 <- new("Person", name = "Hadley", age = 37)
```

In most programming languages, class definition occurs at compile-time, and object construction occurs later, at run-time. In R, however, both definition and construction occur at run time. When you call `setClass()`, you are registering a class definition in a (hidden) global variable. As with all state-modifying functions you need to use `setClass()` with care. It's possible to create invalid objects if you redefine a class after already having instantiated an object:

```{r, error = TRUE}
.A <- setClass("A", slots = c(x = "numeric"))
a <- .A(x = 10)

.A <- setClass("A", slots = c(a_different_slot = "numeric"))
a
```

This isn't usually a problem, because you'll define a class once, then leave the definition alone. If you want to enforce a single class definition, you can "seal" it:

```{r, error = TRUE}
setClass("Sealed", sealed = TRUE)
setClass("Sealed")
```

### Slots

You can access the slots with `@` or `slot()`: `@` is equivalent to `$`, and `slot()` to `[[`. \index{subsetting!S4} \index{S4|subsetting}

```{r}
hadley@age
slot(hadley, "age")
```

You can list all available slots with `slotNames()`:

```{r}
slotNames(hadley)
```

Slots should be considered an internal implementation detail. That means:

* As a user, you should not reach into someone else's object with `@`, 
  but instead, look for a method that provides the information you want.

* As a developer, you should make sure that all public facing slots have
  their own accessor methods.

We'll come back how to implement accessors in [Accessors], once you've learned how S4 generics and methods work.

### Helper

The result of `setClass()` is a low-level constructor, which means that don't need to write one yourself. However, this default constructor has three drawbacks:

*   The constructor takes `...`, not individual named slots. This mean that 
    printing the function is not revealing, and autocomplete doesn't have the  
    data it needs to be helpful.
    
    ```{r}
    .Person
    ```

*   If you don't supply values for a slot, the constructor will automatically 
    supply a default value:

    ```{r}
    .Person()
    ```
    
    Here, you might prefer that `name` is required, or that `age` 
    defaults to `NA`.
    
*   While it's not possible to create an S4 object with the wrong slots or 
    slots of the wrong type:

    ```{r, error = TRUE}
    .Person(name = "Hadley", age = "thirty")
    .Person(name = "Hadley", sex = "male")
    ```
    
    It is possible to create slots with the wrong lengths, or otherwise
    invalid values:
    
    ```{r}
    .Person(name = "Hadley", age = c(37, 99))
    ```

Like with S3, we resolve these issues by writing a helper function.

```{r}
Person <- function(name, age = NULL, ...) {
  if (is.null(age)) {
    age <- rep(NA_real_, length(name))
  }
  
  stopifnot(length(name) == length(age))
  .Person(name = name, age = age)
}
```

This provides the behaviour that we want:

```{r, error = TRUE}
# Name is now required
Person()

# And name and age must have same length
Person("Hadley", age = c(30, 37))

# And if not supplied, age gets a default value of NA
Person("Hadley")
```

It is _possible_ to achieve the same effect by implementing an `initialize()` method, but the `initialize()` generic has a complicated contract and it is very hard to get all the details right.

To re-use checking code in a subclass, you can take advantage of a detail of the constructor: an unnamed argument is interpreted as predefined object from the parent class. For example, to define a constructor for the Employee class that reuses the Person helper, you first create a `Person()`, then pass that to the `.Employee` constructor.

```{r}
Employee <- function(name, age, boss) {
  person <- Person(name = name, age = age)
  .Employee(person, boss = boss)
}
```

As with S3, if the validity checking code is lengthy or expensive, you should pull it out into a separate function which the helper calls.

### Introspection

To determine what classes an object inherits from, use `is()`:

```{r}
is(hadley)
```

To test if an object inherits from a specific class, use the second argument of `is()`:

```{r}
is(hadley, "person")
```

If you are using a class provided by a package you can get help on it with `class?Person`.

### Exercises

1.  What happens if you define a new S4 class that doesn't "contain" an 
    existing class?  (Hint: read about virtual classes in `?setClass`.)

1.  Imagine you were going to reimplement ordered factors, dates, and 
    data frames in S4. Sketch out the `setClass()` calls that you would
    use to define the classes. What should they inherit from? What slots
    should they use?

## Generics and methods

The job of a generic is to perform method dispatch, i.e. find the method designed to handle the combination of classes passed to the generic. Here you'll learn how to define S4 generics and methods, then in the next section we'll explore precisely how S4 method dispatch works.

S4 generics have a similar structure to S3 generics, but are a little more formal. To create an new S4 generic, you call `setGeneric()` with a function that calls `standardGeneric()`. \index{S4!generics} \index{S4!methods} \index{generics!S4} \index{methods!S4}.

```{r}
setGeneric("myGeneric", function(x) standardGeneric("myGeneric"))
```

Note that it is bad practice to use `{` in the generic function. This triggers a special case that is more expensive, and generally best avoided.

Like `setClass()`, `setGeneric()` has many other arguments. There is only one that you need to know about: `signature`. This allows you to control the arguments that are used for method dispatch. If `signature` is not supplied, all arguments (apart from `...`) are used. It is occassionally useful to remove arguments from dispatch. This allows you to require that methods provide arguments like `verbose = TRUE` or `quiet = FALSE`, but they don't take part in dispatch.

A generic isn't useful without some methods, and in S4 you add methods with `setMethod()`. There are three important arguments: the name of the generic, the name of the class, and the method itself. 

```{r}
setMethod("myGeneric", "Person", function(x) {
  # method implementation
})
```

(Again `setMethod()` has other arguments, but you should never use them.)

### Show method

As with S3, the most commonly defined S4 method controls printing, but in S4 we use a different generic: `show()`.

When defining a method for an existing generic, you need to first determine the arguments. You can get those from the documentation or by looking at the formals of the generic:

```{r}
names(formals(getGeneric("show")))
```

Our show method needs to have a single argument `object`:

```{r}
setMethod("show", "Person", function(object) {
  cat(is(object)[[1]], "\n",
      "  Name: ", object@name, "\n",
      "  Age:  ", object@age, "\n",
      sep = ""
  )
})
hadley
```

More formally, the second argument to `setMethod()` is called the __signature__. In S4, unlike S3, the signature can include multiple arguments. This makes method dispatch in S4 substantially more complicated, but avoids having to implement double-dispatch as a special case. We'll talk more about multiple dispatch in the next section.

### Accessor methods

Slots are generally considered to be an internal implementation detail: they can change without warning and user code should avoid accessing them directly. Instead, all user-readble slots should get an __accessor__. If the slot is unique to the class, this can just be a function:

```{r}
person_name <- function(x) x@name
```

But typically, you will want to define a generic and provide a method for your class:

```{r}
setGeneric("name", function(x) standardGeneric("name"))
setMethod("name", "Person", function(x) x@name)

name(hadley)
```

If the slot is also writeable, you should provide an setter function. Typically this function will be more complicated than the getter because you'll need to check that the new value is valid, or you may need to modify other slots. Here we make sure that this functions only allows changing the values, not the length:

```{r}
`person_name<-` <- function(x, value) {
  stopifnot(length(x@name) == length(value))
  x@name <- value
  x
}
```

Again, you'll typically want to do this with a method:

```{r, error = TRUE}
setGeneric("name<-", function(x, value) standardGeneric("name<-"))
setMethod("name<-", "Person", function(x, value) {
  stopifnot(length(x@name) == length(value))
  x@name <- value
  x
})
  
name(hadley) <- "Hadley Wickham"
name(hadley)
```

### Coercion methods

To coerce S4 object from one class to another, use `as()`. One nice feature of S4 is that it provides default coercion methods for you:

```{r error = TRUE}
mary <- new("Person", name = "Mary", age = 34)
roger <- new("Employee", name = "Roger", age = 36, boss = mary)

as(roger, "Person")
```

The defaults are not always quite right. For example, what happens if we try and coerce a Person to an Employee? The coercion succeeds because the `boss` slot is "helpfully" filled in with a default object:

```{r, error = TRUE}
mary_employee <- as(mary, "Employee")
mary_employee@boss
```

We can override the default coercion to supply an informative error.

```{r, error = TRUE}
setAs("Person", "Employee", function(from) {
  stop("Can not coerce an Person to an Employee", call. = FALSE)
})
as(mary, "Employee")
```

### Introspection

To list all the methods that belong to a generic, or that are associated with a class, use `sloop::s4_methods_generic()` and `s4_methods_class()`:

```{r}
library(sloop)
s4_methods_generic("initialize")
s4_methods_class("Person")
```

If you're looking for the implementation of a specific method, you can use `selectMethod()`. You give it the name of the generic and the class (or classes) that it's called with:

```{r}
selectMethod("show", "Person")
```

If you're using a method defined in a package, the easiest way to get help on it is to construct a valid call, and then put `?` in front it. `?` will use the arguments to figure out which help file you need:

```{r, eval = FALSE}
?show(hadley)
```

### Exercises

1.  In the definition of the generic, why is it necessary to repeat the
    name of the generic twice?
 
1.  What's the difference between the generics generated by these two calls?
    
    ```{r, eval = FALSE}
    setGeneric("myGeneric", function(x) standardGeneric("myGeneric"))
    setGeneric("myGeneric", function(x) {
      standardGeneric("myGeneric")
    })
    ```
   
1.  What happens if you define a method with different argument names to
    the generic?

1.  What other ways can you find help for a method? Read `?"?"` and
    summarise the details.

## Method dispatch 

S4 dispatch is complicated because S4 has two important features:

* Multiple inheritance, i.e. a class can have multiple parents, 
* Multiple dispatch, i.e. a generic can use multiple arguments to pick a method. 

These features make S4 very powerful, but can also make it hard to understand which method will get selected for a given combination of inputs. 

To explain method dispatch, we'll start simple with single inheritance and single dispatch, and work our way up to the more complicated cases. To illustrate the ideas without getting bogged down in the details, we'll use an imaginary __class graph__ based on emoji:

```{r, echo = FALSE, out.width = NULL}
knitr::include_graphics("diagrams/s4-emoji.png", dpi = 450)
```

Emoji give us very compact class names (just one symbol) that evoke the relationships between the classes. It should be straightforward to remember that `r emo::ji("stuck_out_tongue_winking_eye")` inherits from `r emo::ji("wink")` which inherits from `r emo::ji("no_mouth")`, and that `r emo::ji("sunglasses")` inherits from both `r emo::ji("dark_sunglasses")` and `r emo::ji("slightly_smiling_face")`

### Single dispatch

Let's start with the simplest case: a generic function that dispatches on a single class with a single parent. The method dispatch here is quite simple, and the same as S3, but this will serve to define the graphical conventions we'll use for the more complex cases.

```{r, echo = FALSE, out.width = NULL}
knitr::include_graphics("diagrams/s4-single.png", dpi = 450)
```

There are two parts to this diagram:

* The top part, `f(...)`, defines the scope of the diagram. Here we have a 
  generic with one argument, and we're going to explore method dispatch for a
  class hierarchy that is three levels deep. We'll only ever look at a small
  fragment of the complete class graph. This keeps individual diagrams simple
  while helping you build intuition that you apply to more complex class 
  graphs.
  
* The bottom part is the __method graph__ and  displays all the possible methods 
  that could be defined. Methods that have been defined 
  (i.e. with `setMethod()`) have a grey background.

To find the method that gets called, you start with the class of the actual arguments, then follow the arrows until you find a method that exists. For example, if you called the function with an object of class `r emo::ji("wink")` you would follow the arrow right to find the method defined for the more general `r emo::ji("no_mouth")` class. If no method is found, method dispatch has failed and you get an error. For this reason, class graphs should usually have methods defined for all the terminal nodes, i.e. those on the far right. 

There are two pseudo-classes that you can define methods for. These are called pseudo-classes because they don't actually exist, but allow you to define useful behaviours. The first pseudo-class is "ANY". This matches any class, and plays the same role as the `default` pseudo-class in S3. For technical reasons that we'll get to later, the link to the "ANY" method is longer than the links between the other classes:

```{r, echo = FALSE, out.width = NULL}
knitr::include_graphics("diagrams/s4-single-any.png", dpi = 450)
```

The second pseudo-class is "MISSING". If you define a method for this "class", it will match whenever the argument is missing. It's generally not useful for functions that take a single argument, but can be used for functions like `+` and `-` that behave differently depending on whether they have one or two arguments.

### Multiple inheritance

Things get more complicated when the class has multiple parents.

```{r, echo = FALSE, out.width = NULL}
knitr::include_graphics("diagrams/s4-multiple.png", dpi = 450)
```

The basic process remains the same: you start from the actual class supplied to the generic, then follow the arrows until you find a defined method. The wrinkle is now that there are multiple arrows to follow, so you might find multiple methods. If that happens, you pick the method that is closest, i.e. requires travelling the fewest arrows. 

(The method graph is a powerful metaphor that helps you understand how method dispatch works. However, implementing method dispatch in this way would be rather inefficient so the actual approach that S4 uses is somewhat different. You can read the details in `?Methods_Details`)

What happens if methods are the same distance? For example, imagine we've defined methods for `r emo::ji("dark_sunglasses")` and `r emo::ji("slightly_smiling_face")`, and we call the generic with `r emo::ji("sunglasses")`. Note that there's no implementation for the `r emo::ji("no_mouth")` class, as indicated by the red double outline.

```{r, echo = FALSE, out.width = NULL}
knitr::include_graphics("diagrams/s4-multiple-ambig.png", dpi = 450)
```

This is called an __ambiguous__ method, and in diagrams I'll illustrate it with a thick dotted border. When this happens in R, you'll get a warning, and one of the two methods is basically picked at random (it uses the method that comes first in the alphabet). When you discover ambiguity you should always resolve it by providing a more precise method:

```{r, echo = FALSE, out.width = NULL}
knitr::include_graphics("diagrams/s4-multiple-ambig-2.png", dpi = 450)
```


The fallback "ANY" method still exists but the rules are little more complex. As indicated by the wavy dotted lines, the "ANY" method is always considered further away than a method for a real class. This means that it will never contribute to ambiguity.

```{r, echo = FALSE, out.width = NULL}
knitr::include_graphics("diagrams/s4-multiple-any.png", dpi = 450)
```

It is hard to simultaneously prevent ambiguity, ensure that every terminal method has an implementation, and minimise the number of defined methods (in order to benefit from OOP). For example, of the six ways to define only two methods for this call, only one is free from problems. For this reason, I recommend using multiple inheritance with extreme care: you will need to carefully think about the method graph and plan accordingly.

```{r, echo = FALSE, out.width = NULL}
knitr::include_graphics("diagrams/s4-multiple-all.png", dpi = 450)
```

### Multiple dispatch

Once you understand multiple inheritance, understanding multiple dispatch is straightforward. You follow multiple arrows in the same way as previously, but now each method is specified by two classes (separated by a comma).

```{r, echo = FALSE, out.width = NULL}
knitr::include_graphics("diagrams/s4-single-single.png", dpi = 450)
```

I'm not going to show examples of dispatching on more than two arguments, but you can follow the basic principles to generate your own method graphs.

The main difference between multiple inheritance and multiple dispatch is that there are many more arrows to follow. The following diagram shows four defined methods which produce two ambiguous cases:

```{r, echo = FALSE, out.width = NULL}
knitr::include_graphics("diagrams/s4-single-single-ambig.png", dpi = 450)
```

Multiple dispatch tends to be less tricky to work with than multiple inheritance because there are usually fewer terminal class combinations. In this example, there's only one. That means, at a minimum, you can define a single method and have default behaviour for all inputs.

### Multiple dispatch and multiple inheritance

Of course you can combine multiple dispatch with multiple inheritance:

```{r, echo = FALSE, out.width = NULL}
knitr::include_graphics("diagrams/s4-single-multiple.png", dpi = 450)
```

A still more complicated case dispatches on two classes, both of which have multiple inheritance:

```{r, echo = FALSE, out.width = NULL}
knitr::include_graphics("diagrams/s4-multiple-multiple.png", dpi = 450)
```

However, as the method graph gets more and more complicated it gets harder and harder to predict which actual method will get called given a combination of inputs, and it gets harder and harder to make sure that you haven't introduced ambiguity. I highly recommend avoiding the combination of the two. There are some techniques (like mixins) that allow you to tame this complexity, but I am not aware of a detailed treatment as applied to S4.

### Exercises

1.  Take the last example which shows multiple dispatch over two classes that
    use multiple inheritance. What happens if you define a method for all
    terminal classes? Why does method dispatch not save us much work here?

## S4 and existing code

Even when writing new S4 code, you'll still need to interact with existing S3 classes and functions, including existing S3 generics. This section describes how S4 classes, methods, and generics interact with existing code.

### Classes

In `slots` and `contains` you can use S4 classes, S3 classes, or the implicit class of a base type.  To use an S3 class, you must first register it with `setOldClass()`. You call this function once for each S3 class, giving it the class attribute. For example, the following definitions are already provided by base R:

```{r, eval = FALSE}
setOldClass("data.frame")
setOldClass(c("ordered", "factor"))
setOldClass(c("glm", "lm"))
```

Generally, these definitions should be provided by the creator of the S3 class. If you're trying to build an S4 class on top of a S3 class provided by a package, it is better to request that the package maintainer add this call to the package, rather than running it yourself. 

If an S4 object inherits from an S3 class or a base type, it will have a special virtual slot called `.Data`. This contains the underlying base type or S3 object: \indexc{.Data}

```{r}
RangedNumeric <- setClass(
  "RangedNumeric",
  contains = "numeric",
  slots = c(min = "numeric", max = "numeric")
)
rn <- RangedNumeric(1:10, min = 1, max = 10)
rn@min
rn@.Data
```

It is possible to define S3 methods for S4 generics, and S4 methods for S3 generics (provided you've called `setOldClass()`). However, it's more complicated than it might appear at first glance, so make sure you thoroughly read `?Methods_for_S3`.

### Generics

As well as creating a new generic from scratch (as shown in [generics and methods]), it's also possible to convert an existing function to a generic. 

```{r}
sides <- function(object) 0
setGeneric("sides")
```

In this case, the existing function becomes the default ("ANY") method:
 
```{r}
selectMethod("sides", "ANY")
```
 
Note that `setMethod()` will automatically call `setGeneric()` if the first argument isn't already a generic, enabling you to turn any existing function into an S4 generic. I think it is ok to convert an existing S3 generic to S4, but you should avoid converting regular functions because it makes code harder to use (and requires coordination if done by multiple packages).

### Exercises

[S4DA]: http://amzn.com/0387759352?tag=devtools-20
[SO-Morgan]: http://stackoverflow.com/search?tab=votes&q=user%3a547331%20%5bs4%5d%20is%3aanswe
[bioc-courses]: https://bioconductor.org/help/course-materials/
[bioc-s4-class]: https://bioconductor.org/help/course-materials/2017/Zurich/S4-classes-and-methods.html
[bioc-s4-overview]: https://bioconductor.org/packages/devel/bioc/vignettes/S4Vectors/inst/doc/S4QuickOverview.pdf