forked from vanatteveldt/learningr
-
Notifications
You must be signed in to change notification settings - Fork 0
/
text_1_corpus.Rmd
277 lines (211 loc) · 10.3 KB
/
text_1_corpus.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
Corpus analysis: the document-term matrix
=========================================
_(C) 2014 Wouter van Atteveldt, license: [CC-BY-SA]_
The most important object in frequency-based text analysis is the *document term matrix*.
This matrix contains the documents in the rows and terms (words) in the columns,
and each cell is the frequency of that term in that document.
In R, these matrices are provided by the `tm` (text mining) package.
Although this package provides many functions for loading and manipulating these matrices,
using them directly is relatively complicated.
Fortunately, the `RTextTools` package provides an easy function to create a document-term matrix from a data frame. To create a term document matrix from a simple data frame with a 'text' column, use the `create_matrix` function
```{r,message=F}
library(RTextTools)
input = data.frame(text=c("Chickens are birds", "The bird eats"))
m = create_matrix(input$text, removeStopwords=F)
```
We can inspect the resulting matrix using the regular R functions:
```{r}
class(m)
dim(m)
```
So, `m` is a `DocumentTermMatrix`, which is derived from a `simple_triplet_matrix` as provided by the `slam` package.
Internally, document-term matrices are stored as a _sparse matrix_:
if we do use real data, we can easily have hundreds of thousands of rows and columns, while the vast majority of cells will be zero (most words don't occur in most documents).
Storing this as a regular matrix would waste a lot of memory.
In a sparse matrix, only the non-zero entries are stored, as 'simple triplets' of (document, term, frequency).
As seen in the output of `dim`, Our matrix has only 2 rows (documents) and 6 columns (unqiue words).
Since this is a rather small matrix, we can visualize it using `as.matrix`, which converts the 'sparse' matrix into a regular matrix:
```{r}
as.matrix(m)
```
Stemming and stop word removal
-----
So, we can see that each word is kept as is.
We can reduce the size of the matrix by dropping stop words and stemming:
(see the create_matrix documentation for the full range of options)
```{r}
m = create_matrix(input$text, removeStopwords=T, stemWords=T, language='english')
dim(m)
as.matrix(m)
```
As you can see, the stop words (_the_ and _are_) are removed, while the two verb forms of _to eat_ are joined together.
In RTextTools, the language for stemming and stop words can be given as a parameter, and the default is English.
Note that stemming works relatively well for English, but is less useful for more highly inflected languages such as Dutch or German.
An easy way to see the effects of the preprocessing is by looking at the colSums of a matrix,
which gives the total frequency of each term:
```{r}
colSums(as.matrix(m))
```
For Dutch, the result is less promising:
```{r}
text = c("De kip eet", "De kippen hebben gegeten")
m = create_matrix(text, removeStopwords=T, stemWords=T, language="dutch")
colSums(as.matrix(m))
```
As you can see, _de_ and _hebben_ are correctly recognized as stop words, but _gegeten_ and _kippen_ have a different stem than _eet_ and _kip_.
Loading and analysing a larger dataset
-----
Let's have a look at a more serious example.
The file `achmea.csv` contains 22 thousand customer reviews, of which around 5 thousand have been manually coded with sentiment.
This file can be downloaded from [github](https://raw.githubusercontent.com/vanatteveldt/learningr/master/achmea.csv)
```{r}
d = read.csv("achmea.csv")
colnames(d)
```
For this example, we will only be using the `CONTENT` and `SENTIMENT` columns.
We will load it, without stemming but with stopword removal, using `create_matrix`:
```{r}
m = create_matrix(d$CONTENT, removeStopwords=T, language="dutch")
dim(m)
```
Corpus analysis: word frequency
-----
What are the most frequent words in the corpus?
As shown above, we could use the built-in `colSums` function,
but this requires first casting the sparse matrix to a regular matrix,
which we want to avoid (even our relatively small dataset would have 400 million entries!).
So, we use the `col_sums` function from the `slam` package, which provides the same functionality for sparse matrices:
```{r}
library(slam)
freq = col_sums(m)
# sort the list by reverse frequency using built-in order function:
freq = freq[order(-freq)]
head(freq, n=10)
```
As can be seen, the most frequent terms are all related to Achmea (unsurprisingly).
It can be useful to compute different metrics per term, such as term frequency, document frequency (how many documents does it occur), and td.idf (term frequency * inverse document frequency, which removes both rare and overly frequent terms).
To make this easy, let's define a function `term.statistics` to compute this information from a document-term matrix (also available from the [corpustools](http:/github.com/kasperwelbers/corpustools) package)
```{r, message=FALSE}
library(tm)
term.statistics <- function(dtm) {
dtm = dtm[row_sums(dtm) > 0,col_sums(dtm) > 0] # get rid of empty rows/columns
vocabulary = colnames(dtm)
data.frame(term = vocabulary,
characters = nchar(vocabulary),
number = grepl("[0-9]", vocabulary),
nonalpha = grepl("\\W", vocabulary),
termfreq = col_sums(dtm),
docfreq = col_sums(dtm > 0),
reldocfreq = col_sums(dtm > 0) / nDocs(dtm),
tfidf = tapply(dtm$v/row_sums(dtm)[dtm$i], dtm$j, mean) * log2(nDocs(dtm)/col_sums(dtm > 0)))
}
terms = term.statistics(m)
head(terms)
```
So, we can remove all words containing numbers and non-alphanumeric characters, and sort by document frequency:
```{r}
terms = terms[!terms$number & !terms$nonalpha, ]
terms = terms[order(-terms$termfreq), ]
head(terms)
```
This is still not a very useful list, as the top terms occur in too many documents to be informative. So, let's remove all words that occur in more than 10% of documents, and let's also remove all words that occur in less than 10 documents:
```{r}
terms = terms[terms$reldocfreq < .1 & terms$docfreq > 10, ]
nrow(terms)
head(terms)
```
This seems more useful. We now have 2316 terms left of the original 20 thousand.
To create a new document-term matrix with only these terms, index on the right columns:
```{r}
m_filtered = m[, colnames(m) %in% terms$term]
dim(m_filtered)
```
As a bonus, using the `wordcloud` package, we can visualize the top words as a word cloud:
```{r, warning=F}
library(RColorBrewer)
library(wordcloud)
pal <- brewer.pal(6,"YlGnBu") # color model
wordcloud(terms$term[1:100], terms$termfreq[1:100],
scale=c(6,.5), min.freq=1, max.words=Inf, random.order=FALSE,
rot.per=.15, colors=pal)
```
Comparing corpora
----
If we have two different corpora, we can see which words are more frequent in each corpus.
Let's create two d-t matrices, one containing all positive comments, and one containing all negative comments.
```{r}
table(d$SENTIMENT)
pos = d$CONTENT[!is.na(d$SENTIMENT) & d$SENTIMENT == 1]
m_pos = create_matrix(pos, removeStopwords=T, language="dutch")
neg = d$CONTENT[!is.na(d$SENTIMENT) & d$SENTIMENT == -1]
m_neg = create_matrix(neg, removeStopwords=T, language="dutch")
```
So, which words are used in positive reviews? Lets make a function to speed it up
```{r}
wordfreqs = function(m) {freq = col_sums(m); freq[order(-freq)]}
head(wordfreqs(m_pos))
```
And what words are used in negative reviews?
```{r}
head(wordfreqs(m_neg))
```
For the positive reviews, the words made sense (_goed_, _snel_). The negative contain more general terms, and the term _fbto_ actually occurs in both.
Can we check which words are more frequent in the negative reviews than in the positive?
We can define a function `compara.corpora` that makes this comparison by normalizing the term frequencies by dividing by corpus size, and then computing the 'overrepresentation' and the chi-squared statistic (also available from the [corpustools](http:/github.com/kasperwelbers/corpustools) package).
```{r}
chi2 <- function(a,b,c,d) {
ooe <- function(o, e) {(o-e)*(o-e) / e}
tot = 0.0 + a+b+c+d
a = as.numeric(a)
b = as.numeric(b)
c = as.numeric(c)
d = as.numeric(d)
(ooe(a, (a+c)*(a+b)/tot)
+ ooe(b, (b+d)*(a+b)/tot)
+ ooe(c, (a+c)*(c+d)/tot)
+ ooe(d, (d+b)*(c+d)/tot))
}
compare.corpora <- function(dtm.x, dtm.y, smooth=.001) {
freqs = term.statistics(dtm.x)[, c("term", "termfreq")]
freqs.rel = term.statistics(dtm.y)[, c("term", "termfreq")]
f = merge(freqs, freqs.rel, all=T, by="term")
f[is.na(f)] = 0
f$relfreq.x = f$termfreq.x / sum(freqs$termfreq)
f$relfreq.y = f$termfreq.y / sum(freqs.rel$termfreq)
f$over = (f$relfreq.x + smooth) / (f$relfreq.y + smooth)
f$chi = chi2(f$termfreq.x, f$termfreq.y, sum(f$termfreq.x) - f$termfreq.x, sum(f$termfreq.y) - f$termfreq.y)
f
}
cmp = compare.corpora(m_pos, m_neg)
head(cmp)
```
As you can see, for each term the absolute and relative frequencies are given for both corpora. In this case, `x` is positive and `y` is negative.
The 'over' column shows the amount of overrepresentation: a high number indicates that it is relatively more frequent in the x (positive) corpus. 'Chi' is a measure of how unexpected this overrepresentation is: a high number means that it is a very typical term for that corpus.
Let's sort by overrepresentation:
```{r}
cmp = cmp[order(cmp$over), ]
head(cmp)
```
So, the most overrepresented words in the negative corpus are words like _risico_, _beter_, and _maanden_. Note that _beter_ is sort of surprising, a sentiment word list would probably think this is a positive words.
We can also sort by chi-squared, taking only the underrepresented (negative) words:
```{r}
neg = cmp[cmp$over < 1, ]
neg = neg[order(-neg$chi), ]
head(neg)
```
As you can see, the list is very comparable, but more frequent terms are generally favoured in the chi-squared approach since the chance of 'accidental' overrepresentation is smaller.
Let's make a word cloud of the most frequent negative terms:
```{r, warning=F}
pal <- brewer.pal(6,"YlGnBu") # color model
wordcloud(neg$term[1:100], neg$chi[1:100],
scale=c(6,.5), min.freq=1, max.words=Inf, random.order=FALSE,
rot.per=.15, colors=pal)
```
And the positive terms:
```{r, warning=F}
pos = cmp[cmp$over > 1, ]
pos = pos[order(-pos$chi), ]
wordcloud(pos$term[1:100], pos$chi[1:100]^.5,
scale=c(6,.5), min.freq=1, max.words=Inf, random.order=FALSE,
rot.per=.15, colors=pal)
```