-
Notifications
You must be signed in to change notification settings - Fork 1
/
FinalProject.Rnw
371 lines (275 loc) · 32.6 KB
/
FinalProject.Rnw
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
\documentclass[a4paper]{article}
\usepackage{float}
\usepackage[margin=1.2in]{geometry}
\usepackage[english]{babel}
\usepackage[utf8]{inputenc}
\usepackage{amsmath}
\usepackage{subcaption}
\usepackage{graphicx}
\usepackage{hyperref}
\usepackage{listings}
\usepackage[colorinlistoftodos]{todonotes}
\title{Student Discourse Through the Years}
\author{Benji Lu, Kai Fukutaki, Ziqi Xiong}
\date{\today}
\begin{document}
\maketitle
\SweaveOpts{concordance=TRUE}
<<echo=FALSE>>=
source('dependencies.R')
source('helpers.R')
source('graphing.R')
articles = readData()
load('40 topics')
articles = join.articles.with.topics(articles,topic.model,3)
@
\section{Motivation}
They say that those who do not know history are doomed to repeat it. We hope to show how campus dialogue at the Claremont Colleges has changed over time, using articles from \textit{The Student Life} as representative of the discourse surrounding hot-button issues on campus. \textit{The Student Life}, or \textit{TSL}, is the student-run, weekly newspaper of the five Claremont Colleges (Pomona, Claremont McKenna, Scripps, Pitzer, and Harvey Mudd). It thus serves as a good glimpse into the thoughts and discourse of the students on campus. The student population itself is in a constant state of flux, as students generally spend only four or five years on campus. This ever-changing population, and thus ever changing set of \textit{TSL} writers (especially given that most writers probably do not work for \textit{TSL} their entire college careers), means that issues in campus dialogue can repeatedly resurface when current students are not aware of the related discussions that took place four or five years ago. Moreover, \textit{TSL} does not have a `` static'' format; that is, the number of articles and pages printed is not necessarily the same each issue. Rather, it varies depending on the amount of content gathered by the student staff. This allows for interesting data on word usage trends, topics covered, authorship, and more over time. How are topics evolving over time? Can we see major spikes in certain topics and phrases and trace those back to major events? Can we use these trends to inform our current dialogue? With better, more easily accessible information on \textit{TSL} history, we can make it easier for future students to see what kind of discussion and action has taken place at the Claremont Colleges so that they can better understand the history of intellectual conversations on our campus and create fresher, deeper discussions in the future.
\section{Methodology}
\subsection{Data Source}
Our raw data was acquired directly from \textit{TSL}'s SQL database on ASPC peninsula server. The data is in the format of three CSV tables:
\begin{enumerate}
\item \textbf{articles}: This table contains the information related to the \textit{TSL} articles, including ID, title, content, status (published or not), created date, published date, updated date, section ID, and issue ID.
\item \textbf{profiles}: This table contains the information related to the \textit{TSL} staff, including ID, name, and position.
\item \textbf{articles\_profiles}: This table contains the authorship of \textit{TSL} articles. Each row contains a \textit{TSL} staff member ID and an article ID.
\end{enumerate}
We joined the three tables by article\_id and profile\_id and filtered out all the articles that are not published, do not have enough content (body text of fewer than 50 words), or are missing author or published date information. We selected important variables including title, content, published date, and author's name for our analysis.
To access the data we used, visit our code repository at \href{https://github.com/ZiqiXiong/TopicModeling}{https://github.com/ZiqiXiong/TopicModeling}.
\subsection{Model Training}
Our goal was to build an unsupervised model that clusters the \textit{TSL} articles by their similarity in word usage so that each cluster can be labeled later and studied as a topic. With that goal in mind, we studied and used Latent Dirichlet Allocation (LDA), developed by Andrew Ng, et al., in 2003, for our project. LDA is a probabilistic model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. When applied to a collection of text documents, LDA views each document as a mixture of a small number of topics and each word as an occurrence generated by one of the document's topics. With different inference techniques such as Gibbs Sampling, LDA is able to learn the composition of each topic (its associated word probabilities) and the topic mixture of each document.\footnote{``Latent Dirichlet Allocation'', Andrew Ng, et al., \href{https://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf}{https://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf}}
Before training our LDA model, we first processed the body text of \textit{TSL} articles by converting all words to lower case and reducing words to their stems, so that a group of words such as ``study'' ,``Study'' and ``studies'' can be treated as one. We also removed punctuation, stop words (the most common short function words), and extremely rare words that would only add noise to our model training. Finally, we transformed the collection of articles into a document-term matrix with each row representing an article and each column representing a word.
By using the R package ``topicmodel'', We trained multiple LDA models with different numbers of topics---25, 30, 35, 40, 45 and 50---as input so that we can adjust the generality of the topics to identify and obtain the most suitable model for our analysis.
\subsection{Result and Visualization}
Our LDA model is capable of clustering key words into human-recognizable topics. Take five random topics discovered by our 40-topic model as an example:
<<>>=
topic.terms = get.topic.terms(topic.model,20)
set.seed(47)
topic.terms[1:5,sample(40,5)]
@
The five topics can be easily identified as ``music,'' ``sex,'' ``campus events,'' ``student protests,'' and ``sports games'' by their top keywords.
In addition to the distribution of words in each topic, the LDA model can also calculate the percentage of each topic in an article and thereby infer the overall subject of the article. Take five random \textit{TSL} articles as an example:
<<>>=
set.seed(37)
# The primary topic of five articles
sample.articles = articles[sample(5000,5),c(2,8)]
names(sample.articles) = c('article.title','primary.topic')
sample.articles
# What the topics are about
topic.terms[1:5, sample.articles[['primary.topic']]]
@
As we can see from the result above, ``Women's Basketball Struggles Through Tough First Games'' is categorized into topic 1, which is about team sports; ``Peace, Popular Opinion, and the Prisoner Exchange'' into topic 16, which is about national politics; ``Panel at CMC Debates Approach to Syria Conflict'' into topic 10, which is about student discussion; `` Students Put Out Fire in Scripps Dorm'' into topic 34, which is about residential life; and ``Ye Olde TSL: 1927, The First Year of the Sponsor Group'' into topic 27, which is about Pomona's sponsor program.
Our model can also identify the main topics in an article that is not already in our \textit{TSL} dataset. For example, it can successfully categorize an 108-word excerpt from Martin Luther King's speech ``I Have a Dream'' into topic 16, which is about national politics, and topic 31, which is about student protest.
<<>>=
test.paragraph = "I have a dream that one day on the red hills of Georgia,
the sons of former slaves and the sons of former slave owners will be able to sit down
together at the table of brotherhood. I have a dream that one day even the state
of Mississippi, a state sweltering with the heat of injustice, sweltering with the heat
of oppression, will be transformed into an oasis of freedom and justice. I have a dream that
my four little children will one day live in a nation where they will not be judged by the
color of their skin but by the content of their character."
test.result = classifyResult(test.paragraph,topic.model)
test.result
topic.terms[1:5, test.result[['Topic.ID']]]
@
After categorizing each article into its related topics, we were able to plot the frequency with which a given topic appears over time by counting the number of articles printed in each issue of \textit{TSL} that are about that topic. For example, here is the plot of topic 31, which is about films:
<<fig=TRUE,height=10,width=15>>=
topicgraph(articles,3,2,14)
@
We can also visualize which topics each writer tends to write about. Take opinions columnist William Schumacher, for example:
<<fig=TRUE,height=10,width=15>>=
print(authorChart2(articles,'William Schumacher',2))
topic.terms[1:10,c(33,10)]
@
\section{Findings}
\subsection{Topic Analysis}
Here, we highlight some interesting findings that we discovered through our topic analysis, beginning with topic trends.
Two prominent movements in the recent history of the Claremont Colleges are the divestment campaign and the push for the unionization of the Pomona's dining hall workers.\footnote{\href{http://www.pomona.edu/news/2013/09/25-divestment-decision}{pomona.edu/news/2013/09/25-divestment-decision}}\footnote{\href{http://www.dailykos.com/story/2014/5/4/1296249/Rethink-Divestment-5-1-2014-Pomona-College}{dailykos.com/story/2014/5/4/1296249/Rethink-Divestment-5-1-2014-Pomona-College}}\footnote{\href{http://www.latimes.com/opinion/op-ed/la-oe-morrison-gould-20141021-column.html}{latimes.com/opinion/op-ed/la-oe-morrison-gould-20141021-column.html}}\footnote{\href{http://www.nytimes.com/2012/02/02/us/after-workers-are-fired-an-immigration-debate-roils-california-campus.html?_r=0}{nytimes.com/2012/02/02/us/after-workers-are-fired-an-immigration-debate-roils-california-campus.html}}\footnote{\href{http://articles.latimes.com/2013/may/01/local/la-me-ln-college-workers-20130501}{articles.latimes.com/2013/may/01/local/la-me-ln-college-workers-20130501}} Using our model's visualizations of topics over time, we found the following trend of \textit{TSL}'s coverage of the two topics:
<<fig=TRUE,height=10,width=15>>=
topicgraph(articles,11,2,14)
@
The graph shows that \textit{TSL} coverage of the two topics was robust from the spring of 2009 up to and including the spring of 2013. In particular, it reached its height between the fall of 2011 and the spring of 2013. Some interesting article headlines include ``NLRB Will Investigate Labor Practice Charges at Pomona Dining Halls'' at the peak in December 2011, ``WFJ Votes to Unionize'' at the very end of May 2013, ``Pomona Opts Not to Divest'' in October 2013, and``Pitzer College to Divest From Fossil Fuel Funds'' around April 2014.
Unfortunately, our model wasn't able to separate the two topics, but the results are nonetheless fascinating. From our own perspective as juniors at Pomona College, the visualization is especially illuminating. We came to Pomona in the fall of 2013, right at the tail end of the peaks of the dining hall and divestment movements, as the visualization accurately shows. The semester before we arrived, the dining hall workers voted to unionize; the semester we arrived, President David Oxtoby announced Pomona's decision not to divest from fossil fuels; and the semester after we arrived, Pitzer College announced that it would divest from fossil fuels. Shortly after that, the discussions about the two topics died down, both in \textit{TSL} and in general campus discourse. Regardless of one's opinions on these issues, they were exciting events in the colleges' history, and while we were able to witness the ultimate outcomes, we had virutally no knowledge of the related events and discussions that preceded the outcomes. With the visualizations our model provides, we're now better able to access key institutional memory in the form of \textit{TSL} coverage of the topics in order to understand these outcomes in their broader context that spans several years.
Another interesting topic that has been trending recently in \textit{TSL} coverage is sexual assault. Here's the graph that our model produced:
<<fig=TRUE,height=10,width=15>>=
topicgraph(articles,20,2,14)
@
The trend is quite dramatic: Sexual assault was rarely covered in \textit{TSL} before 2012, but it quickly gained prominence starting in 2013 and appears to have peaked in 2014. This is pretty consistent with national coverage of college sexual assault.\footnote{\href{http://america.aljazeera.com/watch/shows/america-tonight/articles/2014/11/10/timeline-collegesexualassault.html}{america.aljazeera.com/watch/shows/america-tonight/articles/2014/11/10/timeline-collegesexualassault.html}} The White House published its first report on college sexual assault in April 2014, Columbia University student Emma Sulkowicz began carrying a mattress around campus beginning in September 2014, California became the first state to institute an affirmative consent law in September 2014, and Title IX complaints filed against colleges increased from 11 in the 2009-2010 fiscal year to over 37 in the 2013-2014 fiscal year.\footnote{\href{http://www.nytimes.com/2014/05/04/us/fight-against-sex-crimes-holds-colleges-to-account.html}{nytimes.com/2014/05/04/us/fight-against-sex-crimes-holds-colleges-to-account.html}}
From some of the headlines in the visualization, we can see that the Claremont Colleges have undergone a multitude of changes in response to sexual assault beginning in 2013, including changes in party culture, the creation of task forces dedicated to sexual assault, the implementation of bystander training programs like Teal Dot, and the creation of full-time Title IX coordinator positions.
The final topic trend that we found particularly insightful is the student movements topic. Here's the trend that our model produced:
<<fig=TRUE,height=10,width=15>>=
topicgraph(articles,31,2,14)
@
Based on the graph, it seems that coverage of student movements, which tend to center around differences in dynamics and power, has been fairly consistent over time at the Claremont Colleges. We can see that in December 2012, the Occupy Movement gained some coverage by \textit{TSL}. More recently, in December 2014 students reacted to events in Ferguson by marching through the colleges. The most obvious observation, though, is that the topic really blew up in November 2015, when students protested recent race-related events centered at Claremont McKenna College.\footnote{\href{http://www.latimes.com/local/lanow/la-me-ln-claremont-marches-20151112-story.html}{latimes.com/local/lanow/la-me-ln-claremont-marches-20151112-story.html}}
Another neat feature of our model is its analysis and visualization of topic coverage by writer. Here's an example:
<<fig=TRUE,height=10,width=15>>=
print(authorChart2(articles,'Diane Lee',2))
topic.terms[1:10,39]
@
The graph shows the topics the writer has written the most about. In this case, we can see that Diane Lee, a news writer and eventually editor for \textit{TSL}, wrote a lot of pieces on topic 39, which is about administrative actions and decisions at the Claremont Colleges.
Here's another example:
<<fig=TRUE,height=10,width=15>>=
print(authorChart2(articles,'Sana Kadri',2))
topic.terms[1:10,5]
@
It seems like Sana Kadri wrote a lot on topic 5---internationalism and race---for \textit{TSL}. Our team for this project includes the opinions editor for \textit{TSL} this semester, who can confirm that Sana, who was an opinions columnist, was very passionate about those topics.
Finally, we can examine the topic content of articles written by the editorial board of \textit{TSL}. Each issue, the editorial board, which consists of the editor-in-chief and the two managing editors, writes a brief editorial that is published in the opinions section.\footnote{It shoud be noted that, since \textit{TSL} staff changes every semester, the people who are part of editorial board are not necessarily the same from one semester to the next.} Often, it reflects on another article printed in the same issue. Here's the topic content analysis of the editorial board's pieces:
<<fig=TRUE,height=10,width=15>>=
print(authorChart2(articles,'Editorial Board',2))
topic.terms[1:10,10]
@
The chart indicates that the editorial board seems to focus primarily on topic 10, which is about on-campus student controversies. This would make sense, since events related to such issues often occur and are covered by \textit{TSL}. Also, these issues often generate a lot of discussion and, at times, controversy on campus, making them relevant and appealing topics of conversation for the editorial board to discuss.
\subsection{Miscellaneous Insights}
<<echo=FALSE>>=
articles <- readData()
@
There's a lot to be learned from the data beyond the topic content analysis. For example, with a little bit of wrangling and visualization, we were able to arrive at some interesting findings that might be of some value to the staff at \textit{TSL}. For example, it's no secret that the newspaper struggles with retaining staff writers. But just how bad is it? By tracking the date of each writer's first published article and the writer's last published article, we were able to get a sense of how long they wrote for the paper. We excluded one-time writers for the opinions section, since those tend to be one-time guest writers who are not part of \textit{TSL} staff. Plotting the data, we got the following graph:
<<echo=FALSE>>=
# break down date published into component parts, determine semester published
articles <- articles %>%
mutate(yearPublished = year(published_date)) %>%
mutate(monthPublished = month(published_date)) %>%
mutate(dayPublished = day(published_date)) %>%
mutate(semester = ifelse(monthPublished < 6, 'spring', 'fall'))
# calculate retention by semester for each writer
retention <- articles %>%
group_by(profile_id) %>%
mutate(maxYear = max(yearPublished),
minYear = min(yearPublished),
semesterOff = ifelse(maxYear==yearPublished & semester=='fall','fall','spring'),
semesterJoin = ifelse(minYear==yearPublished &
semester=='spring','spring','fall'),
semestersTotal = ifelse(semesterOff==semesterJoin,(maxYear-minYear)*2+1,
(maxYear-minYear)*2),
totalPublished = n())
retention <- retention[!duplicated(retention[,6]),]
@
<<fig=TRUE>>=
retention %>%
filter(totalPublished > 1 | section_id !=3) %>%
group_by(semestersTotal) %>%
summarise(n = n()) %>%
ggplot(aes(x = semestersTotal)) + geom_line(aes(y=n)) + geom_point(aes(y=n)) +
xlab('Number of Semesters Writing for TSL') + ylab('Number of Writers') +
ggtitle('Retention Rate of TSL Writers') +
scale_x_continuous(limits=c(1,13),breaks=seq(0,13,2))
@
It seems like \textit{TSL}'s reputation for shedding writers isn't exaggerated. Out of the pool of writers over the past five years, 448 writers stay for one semester or less; 108 writers stay for two semesters; 40 stay for three; and 85 stay for four or more semesters. There are a few caveats, though. First, since the database only includes articles published within the last five years, it's possible that some writers who were seniors when the data first began being collected had been writing for \textit{TSL} for many semesters but are represented as only writing for one semester because their earlier articles aren't included in the database. This might lead to an overestimate of the number of writers who only stay for one or two semesters. Second, the method we applied above to calculate retention reports the difference between the first semester they began writing for \textit{TSL} and the last semester they wrote for \textit{TSL}. Some writers, however, take a semester or more off from writing for \textit{TSL} to study abroad or simply pursue other interests before returning to write again. In these cases, the number of semesters they spent writing for \textit{TSL} are overestimated. Lastly, there are a few points that go beyond 8, and even 10, semesters. Those represent a mix of people staying longer and alumni writing guest columns.
We can easily track the retention rate by section as well:
<<fig=TRUE>>=
retention %>%
filter(section_id != 5) %>%
filter(totalPublished > 1 | section_id !=3) %>%
group_by(semestersTotal, section_id) %>%
summarise(n = n()) %>%
ggplot(aes(x = semestersTotal, y=n, color=factor(section_id))) + geom_point() +
geom_line() + scale_color_discrete(name="Section", labels =
c('News','Sports','Opinions','L&S')) + xlab('Number of Semesters Writing for TSL') +
ylab('Number of Writers') + ggtitle('Retention Rate by Section') +
scale_x_continuous(limits=c(1,13),breaks=seq(0,13,2))
@
It seems like the news and life \& style sections suffer the largest dropoff in retention beyond one semester. At least for the news section, this makes some sense; the news section is often the most demanding, since writers sometimes have to squeeze time in their schedules to cover time-sensitive news events on short notice.
Another area of interest might be how prolific writers for \textit{TSL} are---that is, how many articles each writer writes. It seems pretty reasonable to expect that trend to resemble the retention rates shown above. After all, a person can only write so many articles if they're only a writer for one semester. And indeed, we can see that this is the case:
<<echo=FALSE>>=
articles_per_writer <- articles %>%
group_by(profile_id) %>%
summarise(total = n())
@
<<fig=TRUE>>=
articles_per_writer %>%
ggplot(aes(x = total)) + geom_histogram(binwidth = 1, col=I('gray')) +
ggtitle('Histogram of Articles per Writer') +
xlab('Number of Articles Written') + ylab('Number of Writers')
@
We can also break it down by section:
<<fig=TRUE>>=
sections = c('News','LS','Opinions','Sports')
for (num in 1:4) {
section <- sections[num]
assign(paste(section, 'section_articles_per_writer', sep='_'), articles %>%
filter(section_id == num) %>%
group_by(profile_id) %>%
summarise(total = n()) %>%
ggplot(aes(x = total)) + geom_histogram(binwidth = 1, col=I('gray')) +
ggtitle(section) + xlab('Number of Articles Written') + ylab('Number of Writers'))
}
grid.arrange(News_section_articles_per_writer,Opinions_section_articles_per_writer,
LS_section_articles_per_writer,Sports_section_articles_per_writer, ncol=2)
@
The opinions section has the fewest articles per writer---notice the scaling of the y-axis for the opinions plot. The news and life \& style sections seem to have the most articles per writer in general.
Finally, we were able to identify which articles published this semester have been especially popular by looking at the number of times each article was visited. Because data on the number of visitors to each article only began being collected in the fall semester of 2015, we had to limit our analysis to articles in that time period. Below are the top 10 most-viewed articles from the fall:
<<echo=FALSE>>=
clicks <- read.csv('data/articles.csv')
clicks <- clicks[,c(1,4)]
names(clicks) <- c('id','clicks')
# most viewed articles (title, date published, writer, number of clicks)
# FIGURE OUT HOW TO GET THIS TO NOT GO SO FAR TO THE RIGHT
(articles %>%
filter(yearPublished == 2015 & semester == 'fall') %>%
left_join(clicks, by = 'id') %>%
arrange(desc(clicks)) %>%
slice(1:10))[,c(2,4,7,12)]
@
<<echo=FALSE>>=
# for convenience of future code, set clicks to automatically filter fall 2015
clicks <- articles %>%
filter(yearPublished == 2015 & semester == 'fall') %>%
left_join(clicks, by = 'id')
@
For those who are familiar with recent events on campus this semester, the list makes sense. For example, Lisette Espinosa emailed her opinions piece to Claremont McKenna College's former Dean of Student Mary Spellman, who gave a controversial reply that, some would argue, led to her resignation. Indeed, it seems like a lot of the top articles are related to issues of race, sexual assault, campus climate, and college image.
Similarly, we can identify the most-viewed writers on average:
<<echo=FALSE>>=
# we can try to identify the most clicked writers on average
(clicks %>%
filter(section_id != 5) %>%
group_by(author_name) %>%
summarise(average_views = mean(clicks), articles_published=n()) %>%
arrange(desc(average_views)) %>%
slice(1:10))
@
A lot of these are one-hit writers, which makes sense. The opinions section regularly invites guest writers to publish a piece in the paper. Typically, when this happens, the guest writer writes about a particularly controversial or timely subject; that is what motivates them to write, after all. As such, their pieces often get a lot of views. On the other hand, the other sections' pieces are written by regular staff writers, who have to cover the mundane along with the occasional big events each semester. Since the guest writers typically only write once about a hot-button issue, it makes sense that many of the most-viewed writers have only written once for the newspaper.
Nonetheless, we might be interested in seeing which regular contributors to the paper's content get the most views. Here, a ``regular contributor'' is defined as one who has written more than three articles for the paper:
<<echo=FALSE>>=
# let's see among the more consistent writers
(clicks %>%
filter(section_id != 5) %>%
group_by(author_name) %>%
summarise(average_views = mean(clicks), articles_published=n()) %>%
filter(articles_published>3) %>%
arrange(desc(average_views)) %>%
slice(1:10))
@
We now move from examining individual view counts to examining the aggregate data. Here's a histogram of article views for articles published in fall 2015, excluding the outliers mentioned earlier:
<<fig=TRUE>>=
clicks %>%
ggplot() + geom_histogram(aes(x = clicks), binwidth = 10) +
scale_x_continuous(limits = c(0,1000)) +
xlab('Number of Clicks') + ylab('Number of Articles') +
ggtitle('Distribution of Clicks of TSL Fall 2015 Articles')
@
It seems like most articles are viewed fewer than 300 times, although there are a decent number that are viewed more than 750 times.
We can break the data down by section too, this time with boxplots. Again, we exclude outliers to focus on the bulk of the observations:
<<fig=TRUE, warning=FALSE>>=
# exclude outliers from view
clicks %>%
filter(section_id != 5) %>%
mutate(section = ifelse(section_id==1, 'News', ifelse(section_id==2, 'Sports',
ifelse(section_id==3, 'Opinions', 'L&S')))) %>%
ggplot(aes(x = factor(section), y = clicks)) + geom_boxplot() +
scale_y_continuous(limits = c(0,1250)) + xlab('Section') +
ylab('Cumulative Number of Clicks') +
ggtitle('Distribution of Clicks by Section of TSL Fall 2015 Articles')
@
It seems like opinions pieces are viewed the most, though news pieces are not far behind. Articles in the sports and life \& style section, though, are not viewed as much; the upper quartile for sports articles is below the lower quartile for opinions pieces, and the upper quartile for life \& style articles is not much higher. This makes sense given the atmosphere at the Claremont Colleges. Students generally seem less interested in topics covered in sports and life \& style pieces than they are in topics covered in news and opinions pieces.
\section{Cumulative Product}
This project is not only about gaining insights into student discourse, but also about creating an infrastructural tool that allows other interested students to find answers to their own questions about our history. With that goal in mind, we created an interactive shiny app that can respond to users' queries on different topics, authors and articles and then generate plots or perform analyses similar to those in this report. The app is online at \href{https://ziqixiong.shinyapps.io/TopicAnalysis}{https://ziqixiong.shinyapps.io/TopicAnalysis}.
\vspace{1cm}
\includegraphics[width=300px]{shinydemo.png}
\section{Conclusion}
\subsection{Summary}
After building models through Latent Dirichlet Allocation of the corpus of \textit{TSL} articles over the past five years, we were able to track how much coverage various subjects received in \textit{TSL} over time. From this, a rough idea of how these topics have been trending over time can be ascertained. Moreover, using the topic information in conjunction with the articles, authors, and time, many other insights can be gleaned, such as the topic breakdowns for each author and article (how much each author or article can be categorized into particular topics). The models can also be applied to analyze the topic breakdowns of new articles or even entirely different documents. Through these features, we hope that our model can become a tool for future analysis of TSL discourse, and thus student discourse, over time.
Additionanlly, from our brief analysis using our model and data, we identified several interesting characteristics of \textit{TSL}. For example, we found generally that \textit{TSL} has low retention rates for writers and receives most of its views from opinions pieces and guest writers in particular. We suggest further examining these trends and potentially taking steps to address them. For example, \textit{TSL} might benefit from expanding its outreach to guest writers, especially those who are willing to contribute to campus dialogue on hot-button issues.
\subsection{Limitations}
Our model faces several limitations regarding both its analysis of topic coverage in \textit{TSL} and the inferences that can reasonably be drawn from its findings. Some of the topics generated through the model creation process are not very useful. For example, there are a few scattered ``Miscellaneous'' topics that have no logical continuity between the articles described by that topic. The key words used for these miscellaneous topics have little relation to each other. This is bound to happen sometimes, as LDA is an unsupervised method, so we should not necessarily expect it to always find groups that make sense to us. For some numbers of topics, a topic based around weird punctuation or symbols that a few authors use is generated, since these are counted as distinctive words. Sometimes, topics are also combined, as is the case with ``Campus Activism'' combining both the divestment movement and the events surrounding Pomona workers' rights in 2012. On the whole, LDA produced topics that were sensible and comprehensible, ranging from ``Running'' to ``Fashion.'' In the 40-topic model, for instance, there was only one ``Miscellaneous'' topic and one ``NA'' topic (which was due to the weird symbols used by a few authors). This means that 38/40 topics were useful, which is a very good proportion for an unsupervised method. As with all tools, it is useful so long as the shortcomings are kept in mind.
\subsection{Future Work}
The model could also be improved to enable users to obtain a more accurate understanding of how the prominence of topics in campus discourse has changed over time. Perhaps most obviously, the frequency with which a \textit{TSL} article is visited online over a certain period of time likely is significant indicator of the prominence of the article's main topic during that period of time. Currently, the model does not include such information because the number of visitors was not tracked until this semester. As a result, certain topics are represented by the model as more prominent in campus discourse than perhaps they truly were. For example, our model produces three or four topics that are all essentially athletics. Moreover, since the sports section consistently covers sports events at the Claremont Colleges every issue of the newspaper, the model's visualization shows a consistently moderate number of \textit{TSL} articles covering sports, giving the impression that sports figure relatively prominently in the collective campus conscious. Without prior knowledge of our campus culture, it might be easy for someone to draw such an inference, when in fact athletics has a relatively modest place in campus dialogue. The frequency of visitors to each article would provide a different perspective: As shown in Section 4.2, sports articles generally had fewer online visitors than news and opinions articles this semester, and certain pieces related to race-based issues and Pomona branding received the most visitors. These data points are likely more consistent with the nature of campus discourse this past semester. If the model accounted for this type of information in its visualizations, it would present a more accurate picture of how topics trend over time. The issue of topics that aren't useful to humans can be helped through the implementation of interactivity into the LDA building. Since LDA iterates until reaching a local optima, it can be nudged out of this optima by taking user input, such as ``this word cannot be in this topic.'' This would be more of a power user feature than one for the average user, as it requires computational time to reiterate for topic building. Finally, if it would be possible to obtain and convert data on articles before 2009 into a usable format, this could obviously grant more ability to perform historical analysis.
\end{document}