chapter6.Rmd

---
title: 'Subsets and conditions'
description: 'Let''s begin to seek the best conditions for coping with uncertainty.'
---

## Welcome to part 2!

```yaml
type: NormalExercise
key: da73cb2b07
lang: r
xp: 100
skills: 1
```

Welcome to chapter 6: **Subsets and conditions**. In the following chapters (6-10) we will be using a dataset collected from students in the Faculty of Social Sciences in the University of Helsinki. The students filled out [this questionaire](https://elomake.helsinki.fi/lomakkeet/54219/lomake.html), which produced the dataset described [here](http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS2-meta.txt).

The answers were later combined into the following combination variables:
  
- *deep*: mean score of questions related to deep learning
- *stra*: mean score of questions related to strategic learning
- *surf*: mean score of questions related to surface learning
- *attitude*: mean score of questions related to attitude toward statistics  

The *learning2014* dataset includes these variables along with the students age and gender. The data also includes the students points in a statistics exam. Zero points means that the student did not attend an exam. At the end of the course we will take a close look at how a positive attitude predicts students exam performance.

`@instructions`
- Look at the structure of learning2014
- Look at the first six observations of learning2014
- Compute a summary of all the variables in learning2014 (with a single function call)

`@hint`
- Use `head()` to look at the first six observations
- Use `summary()` on learning2014 to compute the summaries of the variables

`@pre_exercise_code`
```{r}
learning2014 <-  read.table("http://www.helsinki.fi/~kvehkala/JYTmooc/learning2014.txt", sep = "\t", header = TRUE)

# Access ggplot2 functions
library(ggplot2)

# Exlude students who did not attend any exams (points == 0)
learning2014 <- learning2014[learning2014$points != 0, ]

# Create objects attitude and points
attitude <- learning2014$attitude
points <- learning2014$points

# A scatter plot of attitude and points
qplot(attitude, points) + geom_smooth(method = "lm")

```

`@sample_code`
```{r}
# learning2014 is available

# Look at the structure of learning2014
str(learning2014)

# Look at the first six observations of learning2014


# Compute a summary of the variables in learning2014


```

`@solution`
```{r}
# learning2014 is available

# Look at the structure of learning2014
str(learning2014)

# Look at the first six observations of learning2014
head(learning2014)

# Compute a summary of the variables in learning2014
summary(learning2014)


```

`@sct`
```{r}

# test_object("object_name")
test_function("str", args=c("object"))
test_function("head", args=c("x"))
test_function("summary", args=c("object"))

test_error()
success_msg("Good work!")
```

---

## Explore your data

```yaml
type: MultipleChoiceExercise
key: d64c2a9918
lang: r
xp: 50
skills: 1
```

Let's get a bit more familiar with the `learning2014` dataset. Beside are plots about the dataset. Browse through them to see the distributions of the variables. After looking at the plots, select the correct claim from the list below.

`@possible_answers`
- There are many outliers in the distribution of students scores on strategic learning.
- Over thirty students got three or less points.  
- There are more men than women in the dataset
- The distribution of deep learning is skewed to right (positive skewness).
- Highest frequency in the histogram of surface learning scores is little over 25.

`@hint`
- Browse through the plots and choose the correct claim.  
- [Skewness](https://en.wikipedia.org/wiki/Skewness)

`@pre_exercise_code`
```{r}
# data
learning2014 <-  read.table("http://www.helsinki.fi/~kvehkala/JYTmooc/learning2014.txt", sep = "\t", header = TRUE)

# for resetting par
def.par <- par(no.readonly = TRUE)

# combination variables
two_plots<-function(x, color, plot_title, x_text){
  layout(matrix(c(1,2), nrow=2, ncol=1), heights = c(2,1))
  par(mar=c(0, 4,3,1), bty="n")
  hist(x, main = plot_title, breaks = 25, col = color, xlim=c(0,5), xaxt="n", xlab = "")
  par(mar=c(5,4,0,1))
  boxplot(x, horizontal = T, ylim=c(0,5),  xlab=x_text, col=color)
}

two_plots(learning2014$surf, 'salmon', "Distribution of students' scores on surface learning", 'Surface learning score')
two_plots(learning2014$deep, 'paleturquoise3', "Distribution of students' scores on deep learning", 'Deep learning score')
two_plots(learning2014$stra, 'mediumpurple2', "Distribution of students' scores on strategic learning", 'Strategic learning score')
two_plots(learning2014$attitude, 'seagreen3', "Distribution of students' scores on attitude towards statistics", 'Attitude score')
par(def.par)

# exam points
points <- table(cut(learning2014$points, (0:11)*3, include.lowest=T))
barplot(points, main = "Distribution of students' exam points", ylab='Frequency', xlab = "Points", col = "slategrey", border="slategrey")

# Age, Gender
layout(matrix(c(1,2), nrow=1, ncol=2), c(2,1))
boxplot(learning2014$age~learning2014$gender, main = "Box plot of students' age by gender", xlab = "age", horizontal = T, col=c("orange", "steelblue"))
barplot(table(learning2014$gender), main = "Distribution of students' gender", col=c("orange", "steelblue"), ylab="Frequency")

```

`@sct`
```{r}
msg1 = "Sorry, wrong. If you look at the boxplot of strategic learning, you can see that there are no outlier marks (like the one in deep learning) and the scores are close to each other."
msg2 = "Nope, in the distribution of students exam points you can see that only about 17 people got three or less points."
msg3 = "No, quite the opposite: 60 men and 120 women! "
msg4 = "This was a hard one. The distribution of deep learning is skewed to left. Try again!"
msg5 = "So true! Proceed!"

test_mc(correct = 5, feedback_msgs = c(msg1,msg2,msg3,msg4, msg5))
success_msg("Going through the plots like a pro. Awesome!")
```

---

## Selecting a subset()

```yaml
type: NormalExercise
key: c12d1b7017
lang: r
xp: 100
skills: 1
```

You have already learned how to look at your data with visualizations and summaries. This is great, but often there is a need to zoom in on a **subset** of the data matching some *condition*.

There are multiple ways to select a subset in R. Perhaps the most convenient method is the `subset()` function. `subset()` is a generic function that works with several data formats. The first argument of `subset()` is the data and the second argument are the *logical conditions* for selecting parts of the data.  

When used on a data.frame, `subset()` returns the rows of the data.frame which match the logical conditions. You will soon learn more about logical conditions.

`@instructions`
- Look at the first six observations of `learning2014`
- Create object `stra_learners` by executing the example codes
- Look at the first 6 rows of the strategic learners subdata
- Create your own subset by selecting the students with attitude less than 3

`@hint`
- Use the functions `head()` and `subset()` as in the example codes.
- The strategic learning variable is named `stra`. The attitude variable is named `attitude`.
- Use the operator `<` similarily to it's counterpart `>` in the example.

`@pre_exercise_code`
```{r}
learning2014 <- read.table("http://www.helsinki.fi/~kvehkala/JYTmooc/learning2014.txt", sep = "\t", header = TRUE)

# Look at the first six observations
head(learning2014)

```

`@sample_code`
```{r}
# learning2014 is available

# Look at the first six observations
head(learning2014)

# select students with above (>) 4 strategic learning
stra_learners <- subset(learning2014, stra > 4)

# Look at the first 6 rows of stra_learners


# select students with below (<) 3 attitude
attitude_low <- 


```

`@solution`
```{r}
# learning2014 is available

# Look at the first six observations
head(learning2014)

# select students with above (>) 4 strategic learning
stra_learners <- subset(learning2014, stra > 4)

# Look at the first 6 rows of the subdata
head(stra_learners)

# select students with below (<) 3 attitude
attitude_low <- subset(learning2014, attitude < 3)


```

`@sct`
```{r}
# submission correctness tests

# example tests:
test_output_contains("head(stra_learners)")
test_object("attitude_low")

# test if the students code produces an error
test_error()

# Final message the student will see upon completing the exercise
success_msg("Excellent work!")

```

---

## Logical comparison

```yaml
type: NormalExercise
key: 73f94cfacf
lang: r
xp: 100
skills: 1
```

What you just saw inside the `subset()` function was  **logical comparison**. Logical comparison creates logical vectors, which can be used to select subsets of data. Logical conditions have other uses as well and are generally important in programming tasks.  

The logical comparison operators in R are:  

operator  | description
----------|------------
`==`      | exactly equal to
`!=`      | not equal to
`<`       | less than
`>`       | greater than
`<=`      | less or equal to
`>=`      | greater or equal to

Follow the instructions below to complete the exercise. Take your time, this might not be a quick exercise.

`@instructions`
- Execute and study the example codes
- Add to the row with `c("a","b","c")`: select a suitable logical comparison operator and a suitable comparison value to produce a result vector `FALSE, FALSE, TRUE`
- Add to the row with `c(1,3,2)`: select a suitable logical comparison operator and a suitable comparison value to produce a result vector `TRUE, FALSE, TRUE`
- Bonus: can you think of other ways of achieving the same results?

`@hint`
- The third example should point you in the right direction
- If you're uncertain, just choose a comparison value and try out a bunch of different operators to figure out what might work.

`@pre_exercise_code`
```{r}
# no pec
```

`@sample_code`
```{r}
# In the comments below: T = TRUE, F = FALSE
# (R also understands these abbreviations)

# Exactly equal
5 == 5 # TRUE

# Not equal
"cat" != "dog" # TRUE

# Greater or equal 3 times
c(0,1,2) >= 1 # F, T, T

# Use logical comparison to produce a result F, F, T
c("a","b","c")

# Use logical comparison to produce a result T, F, T
c(1, 3, 2)


```

`@solution`
```{r}
# In the comments below: T = TRUE, F = FALSE
# (R also understands these abbreviations)

# Exactly equal
5 == 5 # TRUE

# Not equal
"cat" != "dog" # -> TRUE

# Greater or equal 3 times
c(0,1,2) >= 1 # F,T,T

# Use logical comparison to produce a result F,F,T
c("a","b","c") == "c"

# Use logical comparison to produce a result T,F,T
c(1,3,2) < 3

```

`@sct`
```{r}
# submission correctness tests

# example tests:
test_output_contains("c('a','b','c')=='c'", incorrect_msg = "Please make the necessary adustments to produce the output FALSE FALSE TRUE")

test_output_contains("c(1,3,2)<3", incorrect_msg = "Please make the necessary adustments to produce the output TRUE FALSE TRUE")

# test if the students code produces an error
test_error()

# Final message the student will see upon completing the exercise
success_msg("Well done! Remember Spock's words: Logic is the beginning of wisdom, not the end.")

```

---

## Logical operators

```yaml
type: NormalExercise
key: d7b5a24d40
lang: r
xp: 100
skills: 1
```

Logical conditions can be combined with the following logical operators:
  
  operator                | description
----------------------- | -----------
  `!`a                    | NOT a
a `&` b                 | a AND b
a <code>&#124;</code> b | a OR b  
  
Logical operators work like logical conditions: they compare the elements of vectors one at a time and can produce logical vectors.

So, for example `F & T` evaluates to `FALSE` and `F | T` evaluates to `TRUE`. Also, `c(F, T) & c(T, T)` produces a logical vector `FALSE TRUE`.

It is also possible to use parenthesis to control the evaluation of the operators. See how this works from the example codes.

`@instructions`
- Create and print out four subsets of the learning2014 data matching the following conditions: 
- (gender is male, "M") AND (deep learning is greater than 4.5)
- (age is 25 OR age is 26) AND (attitude is greater than 3.5)
- (gender is female, "N") AND (strategic learning is greater than 4.5)
- (deep learning is more than 4) AND (points is zero OR points is greater than 30)

`@hint`
- The instructions show the correct use of parenthesis when using the logical operators and conditions
- Note that the logical equals is `==`, not `=`.

`@pre_exercise_code`
```{r}
learning2014 <- read.table("http://www.helsinki.fi/~kvehkala/JYTmooc/learning2014.txt", sep = "\t", header = TRUE)

```

`@sample_code`
```{r}
# learning2014 is available

# male students who scored very high on deep learning
subset(learning2014, (gender == "M") & (deep > 4.5))

# 25 or 26 old students who scored high on attitude
subset(learning2014, (age == 25 | age == 26) & (attitude > 3.5))

# female students who scored very high on strategic learning


# students who scored high on deep learning for whom points is 0 or greater than 30


```

`@solution`
```{r}
# learning2014 is available

# male students who scored very high on deep learning
subset(learning2014, (gender == "M") & (deep > 4.5))

# 25 or 26 old students who scored high on attitude
subset(learning2014, (age == 25 | age == 26) & (attitude > 3.5))

# female students who scored very high on strategic learning
subset(learning2014, (gender == "N") & (stra > 4.5))

# students who scored high on deep learning for whom points is 0 or greater than 30
subset(learning2014, deep > 4 & (points == 0 | points > 30))


```

`@sct`
```{r}
# submission correctness tests

test_output_contains("subset(learning2014,(gender=='N')&(stra>4.5))", incorrect_msg="Please print out the subset containing the female students with high strategic learning scores")
test_output_contains("subset(learning2014,deep>4&(points==0|points>30))", incorrect_msg="Please print out the subset containing the students who scored high on deep learning and had either no points or very high points")

# test if the students code produces an error
test_error()

# Final message the student will see upon completing the exercise
success_msg("Excellent! You are the operator indeed.")

```

---

## Classical probability

```yaml
type: NormalExercise
key: 71e768aa4c
lang: r
xp: 100
skills: 1
```

Probability is the measure of the likelihood that an event will occur. It is quantified as a number between 0 and 1. Probability is an **essential** tool for statistics since the discipline is primarily interested in measuring uncertainty.

The classical definition of probability is the proportion of favourable cases over all possible outcomes. Thus, the probability of event $A$ with $f$ favourable cases out of $N$ possible outcomes is:
  
  $$P(A) = \frac{f}{N}$$
  
For the questions below assume that we randomly pick a single student from the learning2014 data.

`@instructions`
- Execute the sample codes that create a cross table of gender and attitude categories
- What is the probability that the student is female?
- What is the probability that the student has scored more than 3 on attitude?
- What is the probability that the student is male and has scored more than 4  on attitude?
- Save the probabilities to the objects p2, p3, p4 (instead of `NA`, not available)
- Combine all the probabilities to a vector with `c()` and print it out, rounded to 2 digits

`@hint`
- Use the table created and count the number of favourable cases (f) and the number of all possible cases (N)
- The asked probabilities are then f / N
- Use a comma to combine multiple values with `c()`.

`@pre_exercise_code`
```{r}
learning2014 <-  read.table("http://www.helsinki.fi/~kvehkala/JYTmooc/learning2014.txt", sep = "\t", header = TRUE)

# transform attitude into a factor and cross table with gender
attitude <- cut(learning2014$attitude, 1:5, include.lowest = T)
gndr_att <- table(learning2014$gender, attitude)
print(addmargins(gndr_att))

```

`@sample_code`
```{r}
# learning2014 is available

# transform attitude into a factor and cross table with gender
attitude <- cut(learning2014$attitude, 1:5, include.lowest = T)
gndr_att <- table(learning2014$gender, attitude)
addmargins(gndr_att)

# P(gender = M)
p1 <- 61 / 183

# P(gender = N)
p2 <- NA
  
# P(attitude > 3)  
p3 <- NA
  
# P(gender = M, attitude > 4)
p4 <- NA
  
# Print out the probabilities
round(c(p1), digits = 2)


```

`@solution`
```{r}
# learning2014 is available

# transform attitude into a factor and cross table with gender
attitude <- cut(learning2014$attitude, 1:5, include.lowest = T)
gndr_att <- table(learning2014$gender, attitude)
addmargins(gndr_att)

# P(gender = M)
p1 <- 61 / 183

# P(gender = N)
p2 <- 122 / 183

# P(attitude > 3)  
p3 <- (85 + 19) / 183

# P(gender = M, attitude > 4)
p4 <- 12 / 183

# Print out the probabilities
round(c(p1, p2, p3, p4), digits = 2)


```

`@sct`
```{r}
# submission correctness tests

# example tests:
test_object("p2")
test_object("p3")
test_object("p4")

test_output_contains("round(c(p1, p2, p3, p4),2)")

# test if the students code produces an error
test_error()

# Final message the student will see upon completing the exercise
success_msg("Great! You are probably a genius!")


```

---

## Conditional probability

```yaml
type: NormalExercise
key: 562f9818e0
lang: r
xp: 100
skills: 1
```

As you saw before, conditioning is closely related to selecting a subset. The condition(s) defines the subset of the possible cases. This is also a good way to think about conditional probability: The condition defines the subset of possible outcomes.

Formally, conditional probability is defined by the Bayes formula

$$P(A | B) = \frac{P(A \text{ and } B)}{P(B)}$$
  
But we won't directly need to apply that definition here. 

For the questions below assume that we randomly pick a single student from the learning2014 data.

`@instructions`
- Print a table of attitude and gender by calling `addmargins()`
- What is the probability that the student has scored over 4 on attitude, on the condition that the student is male? Save the probability to `p1`
- What is the probability that the student has scored over 4 on attitude, on the condition that the student is female? Save the probability to `p2`.
- What is the probability that the student is male, on the condition that the student has scored over 4 on attitude? Save the probability to `p3`.
- What is the probability that the student is female, on the condition that the student has scored over 4 on attitude? Save the probability to `p4`.
- Combine all the probabilities to a vector with `c()` and print it out, rounded to 2 digits

`@hint`
- Remember that classical probability is the frequency of favourable outcomes over all possible outcomes Now, the condition determines the subset for the possible outcomes

`@pre_exercise_code`
```{r}
learning2014 <-  read.table("http://www.helsinki.fi/~kvehkala/JYTmooc/learning2014.txt", sep = "\t", header = TRUE)
# transform attitude into a factor and cross table with gender
attitude <- cut(learning2014$attitude, 1:5, include.lowest = T)
gndr_att <- table(learning2014$gender, attitude)

```

`@sample_code`
```{r}
# Print me!
addmargins(gndr_att)

# P(attitude > 4 | gender = M)
p1 <- 12 / 61

# P(attitude > 4 | gender = N)
p2 <- NA

# P(gender = M | attitude > 4)
p3 <- NA

# P(gender = N | attitude > 4)
p4 <- NA

# Print out the probabilities
round(c(p1), digits = 2)

```

`@solution`
```{r}
# Print me!
addmargins(gndr_att)

# P(attitude > 4 | gender = M)
p1 <-  12 / 61

# P(attitude > 4 | gender = N)
p2 <-  7 / 122

# P(gender = M | attitude > 4)
p3 <- 12 / 19

## P(gender = N | attitude > 4)
p4 <- 7 / 19

# Print out the probabilities
round(c(p1, p2, p3, p4), digits = 2)

```

`@sct`
```{r}
# submission correctness tests

# example tests:
test_object("p1")
test_object("p2")
test_object("p3")
test_object("p4")
test_output_contains("round(c(p1, p2, p3, p4), 2)")

# test if the students code produces an error
test_error()

# Final message the student will see upon completing the exercise
success_msg("Wow!! You are awsome no matter the conditions!")

```