-
Notifications
You must be signed in to change notification settings - Fork 1
/
writeup.tex
188 lines (150 loc) · 17.2 KB
/
writeup.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
\documentclass{article}
\usepackage{mathtools}
\usepackage{longtable}
\title{Identification of Semantic Shifts in English Using Word Embeddings}
\author{Peter Schoener}
\date{\today}
\begin{document}
\maketitle
\begin{abstract}
In this paper, I attempt to identify words which have undergone significant semantic shifts over time. I do this using the deviations in the cosines of the most similar words.
\end{abstract}
\section{Introduction}
Semantic shift, the change in meaning of a word over domain or time, is an important phenomenon to take into account when working with narrow corpora or individual texts. It can have a profound effect on the meaning of what is being read, leading to misinterpretation, or may simply look unfamiliar, in which case it is useful to analyze the meanings it may have taken in that particular context.
The ability to automatically flag drifting words and and examine them more closely would also be helpful for understanding the causes, rate, and markers of shifts, which would help not only with language reconstruction, but with prediction and identification of emerging changes.
Perhaps most importantly, shifts must be taken into account when working with broad corpora. The distributional hypothesis, for example, is widely accepted, but when working with a corpus that extends far into the past and attempting to apply it to current language, one might find that the learnt embeddings of certain words are inaccurate for the present, being an average of their current and past meanings.
There is no natural or direct way to quantify the extent of drift, and so my method uses a metric internally and attempts externally only to rank or generate a list of words which should be examined for shifts.
\section{Related Work}
This paper is a more or less direct extension of a paper by Kutuzov and Kuzmenko (c. 2016, not published in this form). That paper looks at shifts as marked by deviation from an averaged set by a specific set in the list of embeddings most similar to that being examined. By counting the number of overlapping words in the lists of the ten closest, they determine whether or not a substantial shift has occurred.
The experiment was originally planned to be more along the lines of Leeuwenberg et al., 2016, but after changing the domain only their methods for evaluating cosines were considered. The relative cosine idea did not, however, affect the usefulness of the final metric.
\section{Method}
This experiment used the British Parliament's Hansard Corpus since it is large enough to have substantial amounts of data even for the earliest years. It totals 1.6 billion tokens, though it seems not all of these are indexed by date; only those indexed could be used since the goal was to separate the models by year. The extracted raw text contains about 350 million tokens, with about 17,000 in the smallest year-specific set.
The text contains many abbreviations, so rather than tokenizing directly or naïvely it made sense the NLTK sentence splitter and then the NLTK tokenizer. The text was normalized to lowercase due to irregular and/or frequent capitalization in certain parts of the record. The embeddings first used were Word2Vec with 50 dimensions, a rather low number but necessary due to computing power constraints. Still, the normal intuitions around embedding models held, but in order to get more accurate results this was later increased to 100.
Although it might have limited the issues caused by the size of the dataset, the text was not lemmatized for two reasons. Some words have vastly different meanings in one particular form than in the rest, meaning the algorithm could be confused by the overly long neighbor list, and secondly this is generally caused by one particular form drifting away from the others rather than by the whole group drifting.
In order to account for rare tokens and sparse individual years, the embedding models were smoothed with a window of five years. This window was chosen so as to keep even fast drifts noticeable while making the embeddings dense enough to use. Only the words that appear in all windows could be considered by the drift metric, so it is important that rare words, being the most prone to drift, not be entirely filtered out.
The metric works by creating the set union of the nearest ten neighbor sets of a word over all windows. This is why it is important that the word be present in all windows; its absence would skew toward a smaller union, and moreover this would be an indicator that it is an uncommon token with imprecise embedding values. The metric is as follows:
$$ f(w) = \sum_{w' \in W} \underset{y}\sigma(cos(w_y, w'_y)) $$
that is, the sum over all words in the neighbor list union $W$ of the standard deviation over all years $y$ in cosine to that word. For a word not present in a given year, the cosines will clearly all be zero, as will therefore the deviation, skewing the sum against marking drift as having occurred. However, this metric has the advantage, given sufficient data, of marking both words which move relative to their neighbors and those which get new neighbors altogether.
The list of all words to which the metric can be applied are then ranked according to it, and the words can be evaluated by hand in descending order. Since this paper is mostly about a method for identifying drift rather than specific instances of it, I did not exhaustively examine the drift of the words returned beyond what was necessary to check the validity of the ranking.
\section{Results}
Filtering for only words which occur in all windows, I arrived at a list of just under a thousand. While this is not an incredibly long list and does not include many uncommon words \textemdash\ which might have been subject to more severe drift \textemdash\ it does rank several common words which have drifted, as well as some more or less technical vocabulary which has markedly higher frequency and/or different meaning in the context of the corpus (legislative discussions).
The top 100 words according to the metric are as follows:
\begin{longtable}[c]{ll}
word & drift \\
\endhead
%
\multicolumn{1}{l|}{separate} & 499.00 \\ \hline
\multicolumn{1}{l|}{liverpool} & 499.22 \\ \hline
\multicolumn{1}{l|}{restriction} & 499.73 \\ \hline
\multicolumn{1}{l|}{private} & 500.12 \\ \hline
\multicolumn{1}{l|}{here} & 500.85 \\ \hline
\multicolumn{1}{l|}{actual} & 503.23 \\ \hline
\multicolumn{1}{l|}{natural} & 506.12 \\ \hline
\multicolumn{1}{l|}{thus} & 506.33 \\ \hline
\multicolumn{1}{l|}{sir} & 512.19 \\ \hline
\multicolumn{1}{l|}{relative} & 514.25 \\ \hline
\multicolumn{1}{l|}{district} & 521.20 \\ \hline
\multicolumn{1}{l|}{chairman} & 522.18 \\ \hline
\multicolumn{1}{l|}{light} & 522.73 \\ \hline
\multicolumn{1}{l|}{began} & 523.44 \\ \hline
\multicolumn{1}{l|}{continuance} & 523.90 \\ \hline
\multicolumn{1}{l|}{propriety} & 532.08 \\ \hline
\multicolumn{1}{l|}{directors} & 539.12 \\ \hline
\multicolumn{1}{l|}{temporary} & 540.50 \\ \hline
\multicolumn{1}{l|}{crisis} & 544.71 \\ \hline
\multicolumn{1}{l|}{respectable} & 546.91 \\ \hline
\multicolumn{1}{l|}{except} & 550.70 \\ \hline
\multicolumn{1}{l|}{afterwards} & 550.98 \\ \hline
\multicolumn{1}{l|}{none} & 555.18 \\ \hline
\multicolumn{1}{l|}{particulars} & 556.38 \\ \hline
\multicolumn{1}{l|}{discipline} & 556.41 \\ \hline
\multicolumn{1}{l|}{above} & 557.75 \\ \hline
\multicolumn{1}{l|}{)} & 558.83 \\ \hline
\multicolumn{1}{l|}{attendance} & 562.22 \\ \hline
\multicolumn{1}{l|}{regular} & 562.70 \\ \hline
\multicolumn{1}{l|}{disposition} & 564.47 \\ \hline
\multicolumn{1}{l|}{precisely} & 564.93 \\ \hline
\multicolumn{1}{l|}{parliamentary} & 565.15 \\ \hline
\multicolumn{1}{l|}{:} & 567.60 \\ \hline
\multicolumn{1}{l|}{forces} & 567.97 \\ \hline
\multicolumn{1}{l|}{?} & 569.46 \\ \hline
\multicolumn{1}{l|}{usual} & 575.83 \\ \hline
\multicolumn{1}{l|}{whenever} & 576.18 \\ \hline
\multicolumn{1}{l|}{lie} & 578.79 \\ \hline
\multicolumn{1}{l|}{exemption} & 582.68 \\ \hline
\multicolumn{1}{l|}{renewal} & 585.45 \\ \hline
\multicolumn{1}{l|}{arms} & 588.48 \\ \hline
\multicolumn{1}{l|}{materially} & 591.85 \\ \hline
\multicolumn{1}{l|}{young} & 594.89 \\ \hline
\multicolumn{1}{l|}{line} & 596.70 \\ \hline
\multicolumn{1}{l|}{latter} & 606.63 \\ \hline
\multicolumn{1}{l|}{censure} & 606.69 \\ \hline
\multicolumn{1}{l|}{lieutenant} & 609.40 \\ \hline
\multicolumn{1}{l|}{deficiency} & 619.78 \\ \hline
\multicolumn{1}{l|}{enemy} & 620.24 \\ \hline
\multicolumn{1}{l|}{notes} & 621.75 \\ \hline
\multicolumn{1}{l|}{fell} & 626.59 \\ \hline
\multicolumn{1}{l|}{ways} & 628.69 \\ \hline
\multicolumn{1}{l|}{king} & 634.82 \\ \hline
\multicolumn{1}{l|}{attending} & 638.10 \\ \hline
\multicolumn{1}{l|}{residence} & 640.48 \\ \hline
\multicolumn{1}{l|}{whereas} & 645.87 \\ \hline
\multicolumn{1}{l|}{mischief} & 646.62 \\ \hline
\multicolumn{1}{l|}{near} & 648.92 \\ \hline
\multicolumn{1}{l|}{mode} & 649.69 \\ \hline
\multicolumn{1}{l|}{w.} & 654.73 \\ \hline
\multicolumn{1}{l|}{principal} & 655.09 \\ \hline
\multicolumn{1}{l|}{corps} & 668.18 \\ \hline
\multicolumn{1}{l|}{rising} & 671.17 \\ \hline
\multicolumn{1}{l|}{directly} & 673.41 \\ \hline
\multicolumn{1}{l|}{clergy} & 684.38 \\ \hline
\multicolumn{1}{l|}{forth} & 686.01 \\ \hline
\multicolumn{1}{l|}{merchants} & 687.78 \\ \hline
\multicolumn{1}{l|}{mr.} & 690.19 \\ \hline
\multicolumn{1}{l|}{resumed} & 690.69 \\ \hline
\multicolumn{1}{l|}{voluntary} & 718.86 \\ \hline
\multicolumn{1}{l|}{(} & 720.04 \\ \hline
\multicolumn{1}{l|}{throne} & 721.10 \\ \hline
\multicolumn{1}{l|}{honourable} & 724.88 \\ \hline
\multicolumn{1}{l|}{dublin} & 728.00 \\ \hline
\multicolumn{1}{l|}{follows} & 728.56 \\ \hline
\multicolumn{1}{l|}{accordingly} & 741.85 \\ \hline
\multicolumn{1}{l|}{attorney} & 774.53 \\ \hline
\multicolumn{1}{l|}{notwithstanding} & 775.09 \\ \hline
\multicolumn{1}{l|}{field} & 784.19 \\ \hline
\multicolumn{1}{l|}{respecting} & 821.67 \\ \hline
\multicolumn{1}{l|}{lately} & 832.44 \\ \hline
\multicolumn{1}{l|}{francis} & 834.03 \\ \hline
\multicolumn{1}{l|}{mr} & 837.30 \\ \hline
\multicolumn{1}{l|}{regularly} & 840.65 \\ \hline
\multicolumn{1}{l|}{distinctly} & 851.35 \\ \hline
\multicolumn{1}{l|}{rose} & 887.50 \\ \hline
\multicolumn{1}{l|}{approbation} & 907.94 \\ \hline
\multicolumn{1}{l|}{c.} & 908.77 \\ \hline
\multicolumn{1}{l|}{volunteer} & 937.33 \\ \hline
\multicolumn{1}{l|}{!} & 965.13 \\ \hline
\multicolumn{1}{l|}{concurred} & 968.07 \\ \hline
\multicolumn{1}{l|}{ad} & 970.48 \\ \hline
\multicolumn{1}{l|}{\& } & 983.99 \\ \hline
\multicolumn{1}{l|}{commander} & 1017.29 \\ \hline
\multicolumn{1}{l|}{begged} & 1020.11 \\ \hline
\multicolumn{1}{l|}{contest} & 1030.57 \\ \hline
\multicolumn{1}{l|}{pursuant} & 1033.49 \\ \hline
\multicolumn{1}{l|}{tending} & 1039.51 \\ \hline
\multicolumn{1}{l|}{species} & 1048.04 \\ \hline
\multicolumn{1}{l|}{principally} & 1195.64 \\ \hline
\end{longtable}
It is worth noting that among the highest ranked words are some with multiple senses, such as ``tending,'' ``pursuant,'' ``contest,'' ``rose,'' etc. For many of these it can be seen in the neighbor lists that the neighbor lists have changed, shifting between the senses, but for some there just happens to be a large amount of noise. Although present, words relating to the senses of ``pursuant'' as in ``pursuing'' and ``in accordance with'' are not very prominent in the neighbor lists, with numbers, tokens garbled by encoding errors or typos, and seemingly unrelated words accounting for a large portion of the variation. Indeed this word must have undergone some degree of shift, but the fact that it was picked up by the algorithm seems largely to have been a coincidence caused by the noise in the limited dataset.
Another category featuring prominently is punctuation and abbreviations, e.g. ``\& ,'' ``!,'' ``mr.,''; these tokens are extremely common and appear in a variety of contexts, being noisy by nature, rather than as a result of the size of the dataset.
Though there are some words in the top list that seem to have undergone drift, it really does seem to be mainly those which have even at present two senses, of which one became more dominant over time. Counterexamples might include ``mischief'', which is ranked very highly and has retained largely the same denotation, but with a considerably softened connotation.
One unintended consequence of the metric is that as a word comes into or falls out of use, a period of instability arises because an embedding can not be accurately trained, leading to marking of words that gain or lose relevance in a big way, such as ``propriety.''
However, on the other end of the scale, it can be seen that particles such as ``a'' with a value of 325.01, ``no'' with 195.71, and ``if'' with 151.87 are among the lowest ranked by this metric. Clearly, function words such as these should be stable, and this is captured by the metric. That being said, there are some words at the bottom which intuitively seem to have undergone drift; ``would'' does not have the same sense today that it does in much archaic text, but it is assigned a value of only 136.61.
\section{Conclusion}
One immediately apparent property of the metric is that it assigns astronomically high values to the top few most drifted items, with the top percentile being twice as highly marked as the tenth. This is not in and of itself a problem, but would suggest that limiting one or more of the factors, for example with a logarithm, could prevent runaway values caused by one outlying feature of a candidate.
The observation that many words flagged simply have two meanings may simply be a result of the dataset going back no further than the beginning of the 19th century; current English is entirely mutually intelligible with the English of then; indeed, much of the most famous literature today is at least that old; if people can still understand that language, clearly the old senses of common words are still known, possibly leading to the perception that a word has not changed meaning and merely tipped in favor of one. Still, this is clearly a form of drift.
An obvious continuation of this experiment would be to rerun it on a larger dataset, possibly one with more general domain, to fill in the gaps that limit the list of words that can be evaluated. As mentioned above another expansion would be in time range; some people already have trouble understanding Early Modern English, so a corpus going back an additional three hundred years would be useful. Any earlier than that and the text would predate the rough standardization of English leading to further difficulties, but already such a broad corpus would be difficult to find or assemble, especially with enough data to get meaningful embeddings. This might limit even further the set of words considered, since many would not appear toward one end of the spectrum or the other.
It would also be interesting to see how well the metric works on words that do not appear in all years, but it should be borne in mind that the current setup, especially evaluating the words on the metric, was already fairly computationally expensive and allowing short gaps would vastly increase the space to be evaluated.
Also, it would be interesting to see the effects of using a lemmatizer during preprocessing. Many of the similar words captured were alternate forms of each other, and consolidating them would hopefully make the effects on them more visible. It would also free up space in the top ten for more words, which could greatly affect the final outcome since the new additions would be more distant and possibly more mobile.
This is not necessarily desirable; the current setup already captures dozens of neighbors for each word, obscuring the closer meanings and possibly skewing the metric with changes in cosine that do not actually represent changes in meaning. It may therefore be better to reduce the number of neighbors captured for the union, which could in turn more greatly reduce the size of that union; less similar and therefore less stable neighbors would not be counted.
All in all the metric is a qualified success; it marks words for which dominance of senses shifts or which go in and out of fashion, while ignoring those that clearly do not change meaning. However, it does seem to have some trouble identifying clear drift, that is, a word losing one meaning and gaining another. The improvements outlined would be a good starting point, but there is clearly far more work to be done in order to get a clear signal out of the noise.
\end{document}