-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathREADME
244 lines (181 loc) · 8.7 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
Release 3.2
Revision: 22 April 2014
This README file describes the NUS MaxMatch (M^2) scorer.
Copyright (C) 2013 Daniel Dahlmeier, Hwee Tou Ng, Christian Hadiwinoto,
and Raymond Hendy Susanto
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or (at
your option) any later version.
This program is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see <http://www.gnu.org/licenses/>.
If you are using the NUS M^2 scorer in your work, please include a
citation of the following paper:
Daniel Dahlmeier and Hwee Tou Ng. 2012. Better Evaluation for
Grammatical Error Correction. In Proceedings of the 2012 Conference of
the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies (NAACL 2012).
Any questions regarding the NUS M^2 scorer should be directed to
Hwee Tou Ng ([email protected]).
Contents
========
0. Quickstart
1. Pre-requisites
2. Using the scorer
2.1 System output format
2.2 Scorer's gold standard format
3. Converting the CoNLL-2014 data format
4. Revisions
4.1 Alternative edits
4.2 F-beta measure
4.3 Handling of insertion edits
4.4 Bug fix for scoring against multiple sets of gold edits, and
dealing with sequences of insertion/deletion edits
0. Quickstart
=============
./m2scorer [-v] SYSTEM SOURCE_GOLD
SYSTEM = the system output in sentence-per-line plain text.
SOURCE_GOLD = the source sentences with gold edits.
1. Pre-requisites
=================
The following dependencies have to be installed to use the M^2 scorer.
+ Python (>= 2.6.4, < 3.0, older versions might work but are not tested)
+ nltk (http://www.nltk.org, needed for sentence splitting)
2. Using the scorer
===================
Usage: m2scorer [OPTIONS] SYSTEM SOURCE_GOLD
where
SYSTEM - system output, one sentence per line
SOURCE_GOLD - source sentences with gold token edits
OPTIONS
-v --verbose - print verbose output
--very_verbose - print lots of verbose output
--max_unchanged_words N - Maximum unchanged words when extracting edits. Default = 2.
--ignore_whitespace_casing - Ignore edits that only affect whitespace and casing. Default no.
--beta - Set the ratio of recall importance against precision. Default = 0.5.
2.1 System output format
========================
SYSTEM = File that contains the output of the error correction
system. The sentences should be in tokenized plain text, sentence-per-line
format.
Format:
<tokenized system output for sentence 1>
<tokenized system output for sentence 2>
...
Examples of tokenization:
-------------------------
Original : He said, "We shouldn't go to the place. It'll kill one of us."
Tokenized : He said , " We should n't go to the place . It 'll kill one of us . "
Note: Tokenization in the CoNLL-2014 shared task uses NLTK word tokenizer.
Sample output:
--------------
===> system <===
A cat sat on the mat .
The Dog .
2.2 Scorer's gold standard format
=================================
SOURCE_GOLD = source sentences (i.e. input to the error correction
system) and the gold annotation in TOKEN offsets (starting from zero).
Format:
S <tokenized system output for sentence 1>
A <token start offset> <token end offset>|||<error type>|||<correction1>||<correction2||..||correctionN|||<required>|||<comment>|||<annotator id>
A <token start offset> <token end offset>|||<error type>|||<correction1>||<correction2||..||correctionN|||<required>|||<comment>|||<annotator id>
S <tokenized system output for sentence 2>
A <token start offset> <token end offset>|||<error type>|||<correction1>||<correction2||..||correctionN|||<required>|||<comment>|||<annotator id>
Notes:
------
- Each source sentence should appear on a single line starting with "S "
- Each source sentence is followed by zero or more annotations.
- Each annotation is on a separate line starting with "A ".
- Sentences are separated by one or more empty lines.
- The source sentences need to be tokenized in the same way as the system output.
- Start and end offset for annotations are in token offsets (starting from zero).
- The gold edits can include one or more possible correction strings. Multiple corrections should be separate by '||'.
- The error type, required field, and comment are not used for scoring at the moment. You can put dummy values there.
- The annotator ID is used to identify a distinct annotation set by which system edits will be evaluated.
- Each distinct annotation set, identified by an annotator ID, is an alternative
- If one sentence has multiple annotator IDs, score will be computed for each annotator.
- If one of the multiple annotation alternatives is no edit at all, an edit with type 'noop' or with offsets '-1 -1' must be specified.
- The final score for the sentence will use the set of edits by an annotation set maximizing the score.
Example:
--------
===> source_gold <===
S The cat sat at mat .
A 3 4|||Prep|||on|||REQUIRED|||-NONE-|||0
A 4 4|||ArtOrDet|||the||a|||REQUIRED|||-NONE-|||0
S The dog .
A 1 2|||NN|||dogs|||REQUIRED|||-NONE-|||0
A -1 -1|||noop|||-NONE-|||-NONE-|||-NONE-|||1
S Giant otters is an apex predator .
A 2 3|||SVA|||are|||REQUIRED|||-NONE-|||0
A 3 4|||ArtOrDet|||-NONE-|||REQUIRED|||-NONE-|||0
A 5 6|||NN|||predators|||REQUIRED|||-NONE-|||0
A 1 2|||NN|||otter|||REQUIRED|||-NONE-|||1
===> system <===
A cat sat on the mat .
The dog .
Giant otters are apex predator .
./m2scorer system source_gold
Precision : 0.8
Recall : 0.8
F_0.5 : 0.8
For sentence #1, the system makes two valid edits {(at-> on),
(\epsilon -> the)} and one unnecessary edit (The -> A).
For sentence #2, despite missing one gold edit (dog -> dogs) according
to annotation set 0, the system misses nothing according to set 1.
For sentence #3, according to annotation set 0, the system makes two
valid edits {(is -> are), (an -> \epsilon)} and misses one edit
(predator -> predators); however according to set 1, the system makes
two unnecessary edits {(is -> are), (an -> \epsilon)} and misses one
edit (otters -> otter).
By the case above, there are four valid edits, one unnecessary edit,
and one missing edit. Therefore precision is 4/5 = 0.8. Similarly for
recall. In the above example, the beta value for the F-measure is 0.5
(the default value).
3. Converting the CoNLL-2014 data format
========================================
The data format used in the M^2 scorer differs from the format used in
the CoNLL-2014 shared task (http://www.comp.nus.edu.sg/~nlp/conll14st.html)
in two aspects:
- sentence-level edits
- token edit offsets
To convert source files and gold edits from the CoNLL-2014 format into
the M^2 format, run the preprocessing script bundled with the CoNLL-2014
training data.
4. Revisions
============
4.1 Alternative edits
In this release, there is a major modification which enables scoring
with multiple sets of gold edits. For every sentence, the system
output will be scored against every available set of gold edits for
the sentence, and the set of gold edits that maximizes the F score of
the sentence is chosen.
This modification was carried out by Christian Hadiwinoto, 2013.
4.2 F-beta measure
While the previous release always uses the F1 measure, i.e. beta =
1.0, this release supports any value for beta. The default value for
beta for this version is 0.5.
This modification was carried out by Raymond Hendy Susanto, 2013.
4.3 Handling of insertion edits
Multiple insertion edits (starting and ending at the same offset) that
match a gold edit were counted repeatedly, leading to erroneous and
inflated scores. A fix has been made to handle this. The order of
insertion edits, which was not handled in the previous version, is now
enforced.
This modification was jointly carried out by Raymond Hendy Susanto and
Christian Hadiwinoto, 2014.
4.4 Bug fix for scoring against multiple sets of gold edits, and
dealing with sequences of insertion/deletion edits
Fixed a bug in the M2 scorer arising from scoring against gold edits
from multiple annotators. Specifically, the bug sometimes caused
incorrect scores to be reported when scoring against the gold edits of
subsequent annotators (other than the first annotator).
Fixed a bug in the M2 scorer that caused erroneous scores to be
reported when dealing with insertion edits followed by deletion edits
(or vice versa).
The above modifications were carried out by Christian Hadiwinoto,
2014.