-
Notifications
You must be signed in to change notification settings - Fork 3
/
README.Rmd
210 lines (137 loc) · 7.1 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
---
title: "Star Trek The Next Generation Dataset"
output: github_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
This README contains info about how to access the file,
[disclaimer](#disclaimer), some use [cases](#usecases).
# TLDR
This dataset contains all episodes of star trek TNG and has seperate rows
for every speech or description that I found in the moviescripts.
Install using `devtools::install_github("RMHogervorst/TNG")` or download the
compressed csv file from raw-data folder. uncompressed the file is approx. 95.2 Mb.
Best results happen when you search using grep. because sometimes names are followed
or preceded by spaces. for instance ` PICARD ` OR `PICARD` OR `PICARD V.O.`.
Licence public domain although the original scripts might not be.
# Short intro
This repo is r package and dataset of the sci-fi series Star Trek The Next Generation.
The dataset has 17 variables/columns and 110176 rows. variable names are here:
```
[1] "episode" "productionnumber" "setnames" "characters"
[5] "act" "scenenumber" "scenedetails" "partnumber"
[9] "type" "who" "text" "speechdescription"
[13] "Released" "Episode" "imdbRating" "imdbID"
[17] "Season"
```
Episode contains the name of the episode, productionnumber, setnames, and characters
were scraped from the toppart of the moviescript. All scripts are divided up into
partnumbers. A part can be a description or speech (as told by the TYPE variable).
speech and descriptions over multiple lines
is put together. ACT, SCENENUMBER, PARTNUMBER tell you what follows what and where
in the episode this happened.
*The variables from Released to the Season are imports from my [IMDB package](https://github.com/RMHogervorst/imdb).*
for example in the episode New Ground somewhere in the episode a certain
grubby crewmember confirms something...
```
all_episodes_TNG[65305,]
```
has episode New Ground, production number #40275-210 a bunch of sets and the
following people in the cast:
PICARD,HELENA ROZHENKO,RIKER,ALEXANDER,DATA,MS. LOWRY,BEVERLY,ENSIGN FELTON,TROI,DOCTOR JA'DAR,GEORDI,WORF,Non-Speaking,SUPERNUMERARIES,SEVERAL BOYS,SEVERAL FATHERS,A SKULL-FACED ALIEN,WAITER
*As you can see Non-Speaking is not really a castmember. but describes the next people*
That happens when you scrape text.
```
act scenenumber scenedetails partnumber type who text speechdescription
1: ONE 6A 95 speech WORF Good. FALSE
```
And as you can see, WORF says "Good." in act one, scene 6a, partnumber 95.
There is no description how Worf says this.
##installation
Install using `devtools::install_github("RMHogervorst/TNG")` or download the
compressed csv file from raw-data folder. uncompressed the file is approx. 95.2 Mb.
# Examples of explorations in this data set {#usecases}
Let's start with some basic explorations.
## number of speaking roles and ratings
How many people are speaking in a episode?
Since I'm using dplyr the endresult will be a tbl_df which prints nicer.
```{r loading packages}
suppressMessages(library(dplyr))
library(TNG)
TNG %>% group_by(episode) %>% distinct(who) %>%
summarize(n_people = n(), rating = mean(imdbRating)) %>%
arrange(desc(n_people), desc(rating) )
```
What is the relation between rating and number of speaking people?
I will also add bit of color for season.
```{r }
library(ggplot2)
TNG %>% group_by(episode) %>% distinct(who) %>%
summarize(n_people = n(), rating = mean(imdbRating), season = mean(Season)) %>%
arrange(desc(n_people), desc(rating) ) %>%
ggplot(aes(n_people, rating, colour = Season)) + geom_point(aes(color = as.factor(season)) , na.rm = TRUE)
```
The number of distinct speakers and rating all center around the same point,
around 30 people and with ratings around 7.5.
I'm intrigued with the lowest rating.
```{r}
TNG %>% group_by(episode) %>% distinct(who) %>% summarize( rating = mean(imdbRating)) %>% arrange( rating)
```
It is episode *shades of gray*.
according to [wikipedia](https://en.wikipedia.org/wiki/Shades_of_Gray_%28Star_Trek:_The_Next_Generation%29)
> It was the only clip show filmed during the series and was created due to a lack of funds left over from other episodes during the season.
> "Shades of Gray" is widely regarded as the worst episode of the series, with critics calling it "god-awful" and a "travesty"; even Hurley referred to it negatively. It can be compared to "Spock's Brain" in The Original Series.
Right.
One character I found really annoying was Q.
In how many episodes is he really. Let's look at the character list in the dataset.
Those episodes must by terrible.
```{r bloody Q}
TNG %>% group_by(episode) %>% filter(grepl(",Q,", characters)) %>%
summarize(rating = mean(imdbRating)) %>% knitr::kable(format = "html")
```
Well they're not. They belong to the best episodes of TNG.
## Descriptions
While I created this dataset I found that descriptions in the script are very nice
This is the first one:
> `r TNG$text[[1]]`
Which made me think, how many times is this description used? It feels as if
the scene is used very often.
```{r finding all the uss warp, tidy=TRUE}
TNG %>% filter(type == "description") %>%
filter(grepl("enterprise", text, ignore.case = TRUE) , grepl("warp speed", text, ignore.case = TRUE)) %>% select(text, Season) %>% knitr::kable(format = "html")
```
Not that often it seems.
## How often does picard drink tea....
![picard drinking tea like a boss](https://cdn.shopify.com/s/files/1/0863/0220/products/picard-c_f2e7a43e-1028-4f4c-91d6-cd7c725e26f0_1024x1024.jpg?v=1455044042)
Found at: <https://www.heatherbuchanan.ca/products/captain-picard-tea-earl-grey-hot-greeting-card>
Picard seems to drink a lot of earl grey tea.
in fact someone did a montage of [all the time he orders it](https://www.youtube.com/watch?v=R2IJdfxWtPM)
```{r }
TNG %>% filter(grepl("PICARD", who), grepl(" tea ", text)) %>% select(who, text, Season, act) %>% knitr::kable(format = "html")
```
That's weird. In the original scripts there is little to no mentioning of earl grey
tea. In fact when I search for the exact phrase it only happens seven times.
```{r tea }
grep("Tea. Earl Grey. Hot", TNG$text, value = TRUE, ignore.case = TRUE)
```
### disclaimer {#disclaimer}
I haven't checked everything and I had some errors during the construction,
so some scripts are not complete
and some parts are perhaps wrongly classified as speech or description.
The creation of the dataset took me 15 hours and linking it to the IMDB database
and creating this package took me another 4 hours.
### Resources
I've dowloaded all the files from <http://www.st-minutiae.com/resources/scripts/>
And discovered that the scripts (mostly...) follow a convention of
- one tab for descriptions
- three tabs for what people say
- five tabs for who says things
- etc
I have used the packages dplyr and readr.
### Licence {#licence}
My dataset is CC0 PUBLIC domain.
I'm very curious to see your analyses of TNG.
Enjoy
Roel M. Hogervorst
2016-3-27