README.Rmd

---
title: "Star Trek The Next Generation Dataset"
output: github_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

This README contains info about how to access the file, 
[disclaimer](#disclaimer), some use [cases](#usecases).

# TLDR
This dataset contains all episodes of star trek TNG and has seperate rows
for every speech or description that I found in the moviescripts. 
Install using `devtools::install_github("RMHogervorst/TNG")` or download the 
compressed csv file from raw-data folder. uncompressed the file is approx. 95.2 Mb. 

Best results happen when you search using grep. because sometimes names are followed
or preceded by spaces. for instance `  PICARD  ` OR `PICARD` OR `PICARD V.O.`.

Licence public domain although the original scripts might not be.

# Short intro
This repo is r package and dataset of the sci-fi series Star Trek The Next Generation.

The dataset has 17 variables/columns and 110176 rows. variable names are here:

```
[1] "episode"           "productionnumber"  "setnames"          "characters"       
[5] "act"               "scenenumber"       "scenedetails"      "partnumber"       
[9] "type"              "who"               "text"              "speechdescription"
[13] "Released"          "Episode"           "imdbRating"        "imdbID"           
[17] "Season"
```

Episode contains the name of the episode, productionnumber, setnames, and characters 
were scraped from the toppart of the moviescript. All scripts are divided up into
partnumbers. A part can be a description or speech (as told by the TYPE variable). 
speech and descriptions over multiple lines 
is put together. ACT, SCENENUMBER, PARTNUMBER tell you what follows what and where
in the episode this happened. 

*The variables from Released to the Season are imports from my [IMDB package](https://github.com/RMHogervorst/imdb).* 

for example in the episode New Ground somewhere in the episode a certain 
grubby crewmember confirms something...

```
all_episodes_TNG[65305,]
```
has episode New Ground, production number  #40275-210 a bunch of sets and the 
following people in the cast:

PICARD,HELENA ROZHENKO,RIKER,ALEXANDER,DATA,MS. LOWRY,BEVERLY,ENSIGN FELTON,TROI,DOCTOR JA'DAR,GEORDI,WORF,Non-Speaking,SUPERNUMERARIES,SEVERAL BOYS,SEVERAL FATHERS,A SKULL-FACED ALIEN,WAITER

*As you can see Non-Speaking is not really a castmember. but describes the next people*
That happens when you scrape text.

```
   act scenenumber scenedetails partnumber   type   who   text speechdescription
1: ONE         6A                       95 speech  WORF  Good.             FALSE
```

And as you can see, WORF says "Good." in act one, scene 6a, partnumber 95.
There is no description how Worf says this. 

##installation
Install using `devtools::install_github("RMHogervorst/TNG")` or download the 
compressed csv file from raw-data folder. uncompressed the file is approx. 95.2 Mb.


# Examples of explorations in this data set  {#usecases}

Let's start with some basic explorations.

## number of speaking roles and ratings

How many people are speaking in a episode?

Since I'm using dplyr the endresult will be a tbl_df which prints nicer.

```{r loading packages}
suppressMessages(library(dplyr))  
library(TNG)
TNG %>% group_by(episode) %>% distinct(who) %>% 
        summarize(n_people = n(), rating = mean(imdbRating)) %>% 
        arrange(desc(n_people), desc(rating) ) 
```

What is the relation between rating and number of speaking people?
I will also add bit of color for season.

```{r }
library(ggplot2)
TNG %>% group_by(episode) %>% distinct(who) %>% 
        summarize(n_people = n(), rating = mean(imdbRating), season = mean(Season)) %>% 
        arrange(desc(n_people), desc(rating) ) %>%
       ggplot(aes(n_people, rating, colour = Season)) + geom_point(aes(color = as.factor(season)) , na.rm = TRUE)
```

The number of distinct speakers and rating all center around the same point, 
around 30 people and with ratings around 7.5. 

I'm intrigued with the lowest rating. 

```{r}
TNG %>% group_by(episode) %>% distinct(who) %>% summarize( rating = mean(imdbRating)) %>% arrange( rating)
```

It is episode *shades of gray*.

according to [wikipedia](https://en.wikipedia.org/wiki/Shades_of_Gray_%28Star_Trek:_The_Next_Generation%29) 

> It was the only clip show filmed during the series and was created due to a lack of funds left over from other episodes during the season.

>  "Shades of Gray" is widely regarded as the worst episode of the series, with critics calling it "god-awful" and a "travesty"; even Hurley referred to it negatively. It can be compared to "Spock's Brain" in The Original Series.

Right. 

One character I found really annoying was Q. 

In how many episodes is he really. Let's look at the character list in the dataset.
Those episodes must by terrible. 

```{r bloody Q}
TNG %>% group_by(episode) %>% filter(grepl(",Q,", characters)) %>% 
        summarize(rating = mean(imdbRating)) %>% knitr::kable(format = "html")
```


Well they're not. They belong to the best episodes of TNG.

## Descriptions 

While I created this dataset I found that descriptions in the script are very nice

This is the first one:

> `r TNG$text[[1]]`

Which made me think, how many times is this description used? It feels as if
the scene is used very often.

```{r finding all the uss warp, tidy=TRUE}
TNG %>% filter(type == "description") %>%
        filter(grepl("enterprise", text, ignore.case = TRUE) , grepl("warp speed", text, ignore.case = TRUE)) %>% select(text, Season) %>% knitr::kable(format = "html")
```

Not that often it seems. 

## How often does picard drink tea....

![picard drinking tea like a boss](https://cdn.shopify.com/s/files/1/0863/0220/products/picard-c_f2e7a43e-1028-4f4c-91d6-cd7c725e26f0_1024x1024.jpg?v=1455044042)
Found at: <https://www.heatherbuchanan.ca/products/captain-picard-tea-earl-grey-hot-greeting-card>

Picard seems to drink a lot of earl grey tea.

in fact someone did a montage of [all the time he orders it](https://www.youtube.com/watch?v=R2IJdfxWtPM)

```{r }
TNG %>% filter(grepl("PICARD", who), grepl(" tea ", text)) %>% select(who, text, Season, act) %>% knitr::kable(format = "html")

```

That's weird. In the original scripts there is little to no mentioning of earl grey
tea. In fact when I search for the exact phrase it only happens seven times. 

```{r tea }
grep("Tea. Earl Grey. Hot", TNG$text, value = TRUE, ignore.case = TRUE)
```


### disclaimer {#disclaimer}
I haven't checked everything and I had some errors during the construction, 
so some scripts are not complete
and some parts are perhaps wrongly classified as speech or description.

The creation of the dataset took me 15 hours and linking it to the IMDB database
and creating this package took me another 4 hours. 

### Resources 
I've dowloaded all the files from <http://www.st-minutiae.com/resources/scripts/>

And discovered that the scripts (mostly...) follow a convention of

- one tab for descriptions
- three tabs for what people say
- five tabs for who says things
- etc


I have used the packages dplyr and readr. 

### Licence  {#licence}


My dataset is CC0 PUBLIC domain. 


I'm very curious to see your analyses of TNG.
Enjoy

Roel M. Hogervorst  

2016-3-27