-
Notifications
You must be signed in to change notification settings - Fork 3
/
README.html
281 lines (228 loc) · 10.9 KB
/
README.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
<head>
<title>fstrain - Software for training finite-state models</title>
</head>
<body>
<h1>fstrain</h1>
<h2>Introduction</h2>
<p>Fstrain is a toolkit for training finite-state models from data. It
is an extension of OpenFst and implements the expectation semiring to
represent and manipulate finite-state machines with (log-linear)
features. It uses R for parameter optimization and can handle
potentially divergent objective functions.</p>
<p>Fstrain can be used to train globally normalized sequence models,
e.g. for part-of-speech (POS) tagging or named-entity recognition
(NER), or string-to-string transductions that may include deletions
and insertions (e.g. for lemmatization or spelling correction), or
train simple maxent classifiers.
</p>
<p>It allows composing several smaller models and training them
jointly, e.g. for a factorial CRF.
</p>
<h2>Build Instructions</h2>
To build the code, please follow the instructions in the
<code>INSTALL</code> file.
<h2>Fstrain as C++ library</h2>
<p>Fstrain can be used as a C++ library. It is actually a collection
of four libraries:
<ul>
<li><em>fstrain-core</em> - This library contains the
expectation semiring (<code>expectation-weight.h</code>) and a
class to represent and compute with negative or positive
numbers in log space
(<code>neg-log-of-signed-num.h</code>). </li>
<li><em>fstrain-create</em> - This library contains code to
create finite-state topologies and place features on the
transition arcs, whose weights can then be trained.</li>
<li><em>fstrain-train</em> - This library contains code for
training of feature weights, different objective functions
(which compute gradients and function value for a given
parameter vector), code to update the weights in a
finite-state machine, etc.</li>
<li><em>fstrain-util</em> - This library contains miscellaneous
code, to check if the cycles in an FST numerically converge
(<code>check-convergence,h</code>), to approximately
determinize an FST (<code>approx-determinize.h</code>), to
convert a string to a one-line FST, etc.</li>
</ul>
</p>
<p>For example, just using the <em>fstrain-core</em> library, you
can write C++ code to create or load FSTs with features, using the
expectation semiring:</p>
<pre>
#include "fstrain/core/expectation-arc.h"
fst::MutableFst<fst::ExpectationArc>* f1 = fst::VectorFst<fst::ExpectationArc>::Read(filename);
fst::VectorFst<fst::ExpectationArc> f2;
typedef fst::ExpectationArc::StateId StateId;
StateId s = f2.AddState();
f2.SetStart(s);
f2.SetFinal(s, fst::ExpectationWeight::One());
fst::ExpectationWeight w(-log(0.8));
w.GetExpectations().insert(0, -log(0.8)); // feature 0 fires once
w.GetExpectations().insert(1, -log(0.8)); // feature 1 fires once
f2.AddArc(s, fst::ExpectationArc(ilabel, olabel, w, s));
</pre>
The standard text file format for reading and writing FSTs is used,
with an added feature vector in each arc weight. Example:
<pre>
0 1 ilabel olabel 0.2231436[0=0.2231436,1=0.2231436]
...
</pre>
where the feature vector contains key-value
pairs <code>[key1=val1,...,keyn=valn]</code>, one for each feature that
fires on the arc.
You can run all FST operations on an FST with features, such as
composition, determinization, minimization etc. The semiring
operations will keep track of feature expectations in the resulting
machines. For details about the semiring, see Eisner (2004).
To train the weights for a given FST, see
<code>fstrain/train/obj-func-fst-factory.h</code>, which contains an
objective function factory that can create an objective function
that takes an FST and uses it to compute gradients etc. for a given
training corpus. The command-line tools, described in the next
section, make use of that code. They are sufficient for some
standard scenarios.
<h2>Command-line tools</h2>
Fstrain contains command-line tools for training a model from data
and applying a trained model to new data.
<h3>Training</h3>
The script <code>scripts/train.R</code> is used for training. It
calls code to construct an FST topology (unless the user provides
one) and to train the weights from data.
<h4>Scenario 1: Train weighted Levenshtein distance</h4>
<pre>
./scripts/train.R \
--isymbols=test/create+train+decode/letters/syms \
--osymbols=test/create+train+decode/letters/syms \
--train-data=test/create+train+decode/letters/train \
--ngram-order=1 \
--force-convergence --lenpenalty-features \
--output=test/create+train+decode/letters/out
</pre>
The example training data <code>train</code> contains input-output
pairs. There is always one input on one line and the corresponding
output on the next:
<pre>
S b r i t t a n y _ s p a e r s E
S b r i t n e y _ s p e a r s E
S b r i n t y _ s p e a r s E
S b r i t n e y _ s p e a r s E
</pre>
<p>The <code>syms</code> file contains all symbols of the latin
alphabet plus the space symbol used above to encode a blank and
start and end symbols. </p>
<p>The options <code>--force-convergence --lenpenalty-features</code>
have the effect that a length penalty is increased before training
until the initial parameter values result in a convergent model,
i.e. a model on which the forward algorithm can be run without
infinite loops.</p>
<p>The option <code>--ngram-order</code> specifies the ngram order
of the ngram model over alignment characters. Levenshtein distance
is a unigram model, changes are independent of the context. Note
that it is very expensive to specify values greater than 1 without
doing any feature selection. See "Scenario 2" for a way to do
that.</p>
<p>The training will write 3 output files
in <code>test/create+train+decode/letters</code>:
<ul>
<li><code>out.feat-names</code>
<li><code>out.feat-weights</code>
<li><code>out.fst</code>
</ul></p> where <code>out.fst</code> is the trained model FST and
the other two files contain the features and their trained
weights. (They can be used to initialize the weights when training
subsequent models, see below.)
<h4>Scenario 2: Train string trandsuction model with wider
context windows</h4>
If models higher than unigrams are used then the space of possible
alignment ngrams needs to be reduced. The default way to do that is
currently to align the input and output sides of each training
example, using a cheap alignment model, and extract the alignment
characters used in the 1-best alignment. Then, using the pruned
alignment alphabet, extract all alignment ngrams (up to the
specified ngram length) in all alignments that are valid under the
reduced alignment alphabet
(see <code>fstrain/create/count-ngrams-in-data.h</code>).
Using the <code>train.R</code> script, this is done with the
following command:
<pre>
./scripts/train.R \
--isymbols=test/create+train+decode/letters/syms \
--osymbols=test/create+train+decode/letters/syms \
--train-data=test/create+train+decode/letters/train \
--ngram-order=2 \
--force-convergence --lenpenalty-features \
--align-fst=test/create+train+decode/letters/out.fst \
--init-weights=test/create+train+decode/letters/out.feat-weights \
--init-names=test/create+train+decode/letters/out.feat-names \
--output=test/create+train+decode/letters/out2
</pre>
The cheap alignment model is specified as a transducer; here we use
the simple unigram transducer trained in Scenario 1
above: <code>--align-fst=test/create+train+decode/letters/out.fst</code>. We
also initialize the features from that training run above
(see <code>init-weights</code> and <code>init-names</code>).
<h4>Scenario 3: Train simple CRF tagger</h4>
Fstrain can also be used to train simple taggers, e.g. a POS tagger
or a chunker that assigns an <code>I</code>, <code>O</code>
or <code>B</code> tag to each input symbol. This is a much simpler
case than the above two scenarios because there are no insertions
and deletions and therefore no ambiguous alignments; every input
symbol is deterministically aligned with its output symbol.
The following command trains a chunking model:
<pre>
./scripts/train.R \
--isymbols=test/create+train+decode/chunker/isyms \
--osymbols=test/create+train+decode/chunker/osyms \
--train-data=test/create+train+decode/chunker/train \
--ngram-order=2 \
--align-fst=simple \
--output=test/create+train+decode/chunker/out
</pre>
Here we also prune the space of allowed ngrams, but we do not need
to specify an actual FST with <code>--align-fst</code>; instead we
use the reserved word <code>simple</code>, which specifies that the
alignment between input and output symbols is simply one-to-one.
The training data has the same format as before, but here, word/POS
combinations are used in the input and tags in the output:
<pre>
$ head test/create+train+decode/chunker/train
S confidence/NN in/IN the/DT pound/NN is/VBZ widely/RB expected/VBN to/TO take/VB another/DT sharp/JJ UNK if/IN trade/NN figures/NNS ...
S B_NP B_PP B_NP I_NP B_VP I_VP I_VP I_VP I_VP B_NP I_NP I_NP B_SBAR B_NP I_NP ...
</pre>
There is an additional command line option <code>--features</code>,
which specifies which feature set should be used in the FST
topology. However, it is currently not implemented, and without
hacking the source the only features used are complete ngram windows
(of various sizes up the the specified ngram max length), such as
the bigram feature "<code>the/DT|B_NP
necessary/JJ|I_NP</code>". Stay tuned for upcoming versions of
fstrain! (If you want to specify your own features in the current
version, take a look
at <code>fstrain/create/trie-insert-features.h</code>.)
<h4>Scenario 4: Train the weights of any FST</h4>
You can also create the model FST yourself, specifying all
transitions that are allowed under your model, together with the
features that fire. In that case, you can simply train its feature
weights with the following command:
<pre>
./scripts/train.R \
--isymbols=<i>input-symbols</i> \
--osymbols=<i>output-symbols</i> \
--train-data=<i>train</i> \
--fst=<i>fst</i> \
--output=<i>output-stem</i>
</pre>
<h3>Decoding</h3>
A simple decoder called <code>transducer-decode</code> is
available. It can be called like this, where the FST is the one
trained with the training command above, see Scenario 3:
<pre>
./build/transducer-decode \
--isymbols=test/create+train+decode/chunker/isyms \
--osymbols=test/create+train+decode/chunker/osyms \
--fst=test/create+train+decode/chunker/out.fst \
test/create+train+decode/chunker/test
</pre>
Please keep in mind that this software is in beta version.
<div align="right">Markus Dreyer, 2010-2013</div>
</body>