forked from AdamMeyers/The_Termolator
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.txt
executable file
·415 lines (343 loc) · 27.4 KB
/
README.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
The Termolator Program 0.1 is licensed under the Apache license 2.0
(http://www.apache.org/licenses/LICENSE-2.0). It was created by Adam
Meyers, Yifan He, Zachary Glass and Shasha Liao and released in July,
2015.
The Termolator Program 0.2 (including 0.21, 0.22, 0.23, etc.) is also
licensed under the Apache license
2.0(http://www.apache.org/licenses/LICENSE-2.0). It is a revision of
the original program. It was revised by Adam Meyers, John Ortega, Vlad
Tyshkevich, Yuling Gu and Leizhen Shi. The main changes are: (i) bug
fixes; (ii) The new version uses Python 3 exclusively -- the Python 2
portions of the original code have all been changed; (iii) additional
features have been added so the program will work better with legal
text; (iv) changes to the abbreviation program intended to improve
precision, (v) the capability to store web search based scores and
look them up on subsequent runs rather than recalculating them; and
(vi) a caching feature has been added to make runs with different
foregrounds, but the same background file more efficient; (vi) The
tchunk component of the previous system has been replaced (later
versions) with a substring component. The main effect of this is that
the substrings of noun chunks that are tested by the system are
linguistically motivated -- they must be legitimate noun groups
themselves and they must be observed in other contexts; (vii) a n-gram
filter (later versions) has been added so that unusually formatted
text (bibliographies, tables, charts, etc.) are likely to be ignored
as input; (vii) a version of the Chinese Termolator is incorporated
that is more closely connected to the English version (the import of
this feature is not complete as of this writing)
The Termolator takes two sets of documents as input a FOREGROUND set
and a BACKGROUND set and finds instances of terminology that are more
characteristic of the FOREGROUND than the BACKGROUND. Input files can
be either .txt, .html or .xml (the latter only working if it uses HTML
style markup to delimit text). UTF-8 encoding (which includes ASCII)
is preferred, but ISO-8859-1 will work as well.
The biggest difference in running the system as of May 2019 is the
need for an extra parameter (parameter number 14), described below.
The essential details of the June 2018 system are described in the
following paper:
A. Meyers, Y. He, Z. Glass, J. Ortega, S. Liao,
A. Grieve-Smith, R. Grishman and O. Babko-Malaya (2018). The
Termolator: Terminology Recognition Based on Chunking,
Statistical and Search-Based Scores Research Metrics and
Analytics (RMA).
This can be downloaded at:
https://www.frontiersin.org/articles/10.3389/frma.2018.00019/full
We will continue to modify the system over time.
Dependencies: If you have not already done so, you should install the
python 3 version of the NLTK (see http://www.nltk.org/install.html).
To see if it is already installed, attempt to "import nltk" in
Python 3. If you get an error, than install it.
Instructions for Using program:
1) We will assume that $TERMOLATOR is the path containing the
TERMOLATOR program. Setting an environmental variable (called
TERMOLATOR) is suggested.
2) To run the main system, the command is as follows (an additional
way of running the system will be described in section 7):
$TERMOLATOR/run_termolator.sh FOREGROUND BACKGROUND EXTENSION OUTPUT_NAME TRUE-OR-FALSE TRUE-OR-FALSE 30000 5000 PROGRAM-DIRECTORY ADDITIONAL_TOPIC_STRING TRUE-OR-FALSE general_file_name_or_FALSE SHARED_BACKGROUND_FILENAME.pkl MINIMUM_PROBABILITY_OR_FALSE
The arguments are defined as follows:
Argument 1 (FOREGROUND) = a file listing the documents in the foreground set
Argument 2 (BACKGROUND) = a file listing the documents in the background set
Argument 3 (EXTENSION) = the extension of input files
Argument 4 (OUTPUT_NAME) = name of output file (without extension)
Argument 5 (TRUE-OR-FALSE) = True or False (do background files need to be processed?)
If False, stored background cache will be loaded (see Argument 13).
If True, background information will be stored as a cache file (see Argument 13).
Argument 6 (TRUE-OR-FALSE) = True if you want the system to use the relevance
score for determining rankings and False otherwise.
Argument 7 Maximum Number of Terms Considered (suggested = 30000)
Argument 8 Top N -- number of terms you want to keep in the end (suggested 5000-10000)
Argument 9 (PROGRAM-DIRECTORY) = the directory where the program is, e.g.,
$TERMOLATOR if you set this variable.
Argument 10 (ADDITIONAL_TOPIC_STRING) = topics connected with a plus sign, e.g., legal+finance.
These topics are split by plus signs. The resulting topics correspond to key words
in the dictionary_table variable in term_utilities.py. Currently, only the "legal"
topic is supported. If there are no additional topics, this variable should have
"false" as a value. If you add the legal topic, a dictionary of legal terms will
be downloaded and some specialized rules will be invoked for abbreviations. Other
topic specific features may be added in the future.
Argument 11 If True, skip preprocessing for Foreground. This comes in useful if you want to
run the same Foreground with different backgrounds or if for any reason, you have
already preprocessed the foreground file. So usually, this field should just
contain "False".
Argument 12 The name of some of the shared cached data to be used in multiple runs. This is currently
being used as a prefix for both webscore files and for lemma dictionary files, e.g., if
argument 12 is 'biology', the files will either be created or updated when the program
is run. If the value of argument 12 is False, Argument 4 will be used instead, i.e., the files
$4_lemma.dict and $4.webscore will be used. Of course a webscore file will only be generated
in either case if Argument 6 is True. These cached files make it unnecessary to recalculate
webscores for terms that have previously been looked up (saving as much as .75 seconds per
term). It also allows different forms of a lemma to be saved over a larger amount of documents
(both foreground and background) so that more forms of a lemma are likely to be found.
Argument 13 The name of the background cache file associated with Argument 5. If you list False as the background
file, a default name (ranking.pkl) will be used (info will be saved to or loaded from this file). It
is suggested that you use the .pkl file type, since this is a Python pickle file.
Argument 14 Is either a number, the string "patent", "normal" or "False". This argument is used by the language
model to cause the program to eliminate some blocks of text. The intension is that "abnormal" text
such as bibliographies, tables or charts will be ignored because terms extracted from such text are
likely to be of low quality. "False" causes this component not to do anything. The number should be a
negative number between 0 and -2. We are currently using -.2 (negative 2/10) for patent text and -1 for
other texts. You can also use the strings "patent" and "normal" in place of these 2 values. The meaning
is that text that the language model classifies as having a probability of less than some number of standard
deviations from the mean probability is ignored, whereas higher probability text is processed. Patents tend
to have more tables and in-text bibliographies than "normal" text and therefore we assume a higher
threshold.
IMPORTANT PATH INFORMATION: If FOREGROUND and BACKGROUND files contain absolute paths, this command will work from anywhere. Otherwise, you should run from the directory containing FOREGROUND and BACKGROUND. We will call this the DATA_DIRECTORY. The files listed in FOREGROUND andBACKGROUND should be paths relative to the DATA_DIRECTORY.
To test the program, we suggest going to one of our 3 test directories and running the command from there. Note that
we will shorly be adding a corpus of court decisions, for which the legal topic features are useful. We have not
yet tested whether these same features are useful for the patent directory provided here.
a) subdirectory: gutenberg-test
command: $TERMOLATOR/run_termolator.sh foreground.list background.list .htm knitting True True 30000 5000 $TERMOLATOR False False False gutenberg.pkl -1
-- The "True" setting in field 6 will make this run take an extra 10 minutes to run about 600
web searches, but the results are more accurate as a result.
-- After this run, the background statistics will be stored in gutenberg.pkl. Thus, for a
second run, possibly with a different foreground, setting the fifth parameter to False
will make run time be somewhat faster, especially if the sixth parameter is set to False and
the websearch score is not used.
-- These texts are taken from the English portion of Project Gutenberg, a repository of public
domain texts. The FOREGROUND is a set of chapters of "The Project Gutenberg eBook of Encyclopedia
of Needlework", by Therese De Dillmont. The BACKGROUND is a random selection of documents from the
same repository. Thus the resulting terminology list consists of terms in the domain of "knitting".
For more information of Project Gutenberg, go to: https://www.gutenberg.org/
-- The threshold probability is set to -1 standard deviations below the mean probability.
b) subdirectory: OANC-test
command: $TERMOLATOR/run_termolator.sh foreground.list background.list .txt biology True False 30000 5000 $TERMOLATOR False False False OANC.pkl -1
-- This run will be faster than the previous run per term generated. If the False in field in field 6 is
replaced by True, the system will take an extra 3 hours (about 1 second for each of 10,000 terms),
but the results will be better.
-- These texts are taken from the Open American National Corpus (OANC), a project devoted to collecting
freely available text for processing by computational linguistics. The FOREGROUND consists of 100
biology related documents and the background consists of 100 random documents. The resulting terminology
are all about biology. For more information about the OANC, go to: http://www.anc.org/data/oanc/
-- The threshold probability is set to -1 standard deviations below the mean probability.
c) subdirectory: patent-test
command: $TERMOLATOR/run_termolator.sh foreground.list background.list .XML surgery True False 30000 5000 $TERMOLATOR False False False patent.pk -.2
-- This run should generate about 4700 terms. If False in field 6 is changed to True, it will take an additional
1.5 hours, but achieve better results.
-- These documents are taken from Google patents. We downloaded files and divided them by the US patent
classes encoded in the documents. The foreground is a set of patents in class 606 (see the
main_classification field in the XML), which are all about "Surgery". The background is a set of
randomly selected patents. The resulting terminology are all in the domain of "Surgery".
-- The threshold probability is set to -.2 standard deviations below the mean probability. Blocks of text are
likely to be ignored if they are too different from the norm (based on the OANC text).
3) Choice of foreground and background documents. Different choices will effect the sort of terminology the system
will recognize. We recommend between 500 and 5000 small files for both foreground and background or fewer large
files. The example documents should give you some idea of what is possible. As the examples show, different numbers
of files are possible.
We suggest creating directories which have only the input (foreground and background) files in them and then none of the
files have any of the following file extensions, as these may be overwritten by the system: .abbr .fact .pos .substring .terms .txt2 .txt3 .
4) The output produce by the file includes the following, based on the OUTPUT_NAME:
OUTPUT_NAME.out_term_list --- This is the final output, a list of the top N terms in order of rank,
where N = Argument 8 above. Each line consists of the term lemma
followed by variants of that lemma, separated by tabs. Consider the
following 2 lines from the sample OANC-test biology.out_term_list file:
glucocorticoid receptor\tglucocorticoid receptors\tgr
p53 activation\tactivation of p53
As these examples show, the lemma appears first. Then alternative forms
appear including plurals, abbreviations and/or noun phrases with
prepositional phrase right modifiers (we assume that the left modified noun
noun compounds are the lemmas).
OUTPUT_NAME.scored_output --- This is a superset of the previous list, approximately the top 30% of the terms
considered by the system or the top K terms, if K is lower than 30%, and
where K = Argument 7 above. If Argument 6 is set to True, a .8 seconds-long
web search is used to score each of the elements in this list. Thus 30,000 terms
will take around 24,000 seconds or just under 7 hours.
--- There are several columns on each line, divided by tabs, as follows:
Column 1 -- the term (just the lemma)
Column 2 -- a rule classification (used to determine if the term is well-formed)
Column 3 -- a value Good, Medium, Neutral (measuring the quality of the term by some set of rules)
Column 4 -- a score between 0 and 1 measuring term quality
Column 5 -- a score between 0 and 1 measuring the term rank
Column 6 -- column 4 X column 5
Column 7 -- the relevance score (if being used)
Column 8 -- column 4 X column 5 X column 7 (if relevance score is used)
OUTPUT_NAME.dict_abbr_to_full -- This is a dictionary taken from the Foreground that maps abbreviations to their
full forms, e.g., HTML --> Hypertext Markup Language. It is used for determining
if a sequence of words is a valid term.
OUTPUT_NAME.dict_full_to_abbr -- This is a dictionary going in the opposite direction, from full forms to terms, e.g.,
Hypertext Markup Language --> HTML
OUTPUT_NAME.rejected-terms -- These are terms generated by the system, but rejected either because they are deemed to be
ill-formed or they were not ranked sufficiently highly to by the initial steps to be
considered (top 30% or top 30K if argument 7 is set to 30,000).
-- There are several columns, similar to OUTPUT_NAME.scored_output
Column 1 -- the term
Column 2 -- "FILTERED_OUT" if removed due to a well-formedness rule or "BEYOND-CUTOFF", if
removed because the term was ranked lower than 30% or 30K as discussed above.
Column 3 -- a rule classification (used to determine if the term is well-formed)
Column 4 -- a value Good, Medium, Neutral (measuring the quality of the term by some set of rules)
Column 5 -- a score between 0 and 1 measuring term quality
Column 6 -- a score between 0 and 1 measuring the term rank
Column 7 -- column 5 X column 6
OUTPUT_NAME.all_terms -- This is the intermediate list of terms that is generated before 30% or 30K terms are
rejected. The scores are based on a distributional component of our system.
5) For each file processed, the following intermediate files are generated:
FILE.fact -- this identifies where blocks of text start and end (e.g., paragraphs in html)
FILE.txt3 -- the .fact file is pointing to start and end character offsets in this file
FILE.pos -- this provides part of speech tags for each token and is used in processing.
the start and end numbers point to character offsets in the .txt3 file
FILE.terms -- this identifies terms inline, by means of our chunking program. The start
and end positions point to character offsets in the .txt3 file.
FILE.abbr -- this file identifies relations between abbreviations and full forms in text
FILE.subtring -- this file includes both terms from FILE.terms and well-formed
substrings of those words used by the distributional system.
Note that in earlier versions of Termolator, .tchunk and
.tchunk.nps files were used for a similar purpose.
6) In addition the following files are created for purposes of speeding up processing in multiple runs:
$13 -- a pickled file storing the background component of the distributional score
$12_lemma.dict -- a lemma dictionary, used to generate the non-initial columns in
the .out_term_list files
$12.webscore -- A file saving webscores for terms
7) In addition to the "main" way of running the system, there are three additional options.
A. Single File as foregound. It is possible to run using a single
file as foreground. We are currently generating one set of terms
using each supreme court decision as foreground and the full set of
supreme court decisions as background. In future versions, we will
provide an example from this run. To do this, we use the following
script with the following arguments:
run_termolator_with_1_file_foreground.sh FOREGROUND BACKGROUND EXTENSION OUTPUT_NAME TRUE-OR-FALSE TRUE-OR-FALSE 10000 1000 PROGRAM-DIRECTORY ADDITIONAL_TOPIC_STRING TRUE_OR_FALSE general_file_name_or_FALSE SHARED_BACKGROUND_FILENAME.pkl MINIMUM_PROBABILITY_OR_FALSE
This takes all the same arguments as run_termolator.sh, with the following exceptions:
Argument 1 (FOREGROUND) is one foreground file, rather than a file
containing a list of files. The filetype is left out
Arguments 7 and 8 -- lower numbers are recommended for single
files. Intially we assume 10000 and 1000, but these numbers are
probably too high
For the first run, Arguments 5 and 11 should both be True and True
if the foreground file is part of the background. If not, then
Argument 11 should be False. For subsequent runs Argument 5 should
be False (assuming the same background is being used).
B. Phase 1 Only -- Suppose you do not want to run the full
Termolator system. Suppose you are only interested identifying
the technical noun groups that we use as input to the
distributional system.
The script run_terms_only.sh takes only five arguments, a subset of the arguments of the full
run_termolator.sh script.
Argument 1: input files -- a file listing the input txt, or hml (or xml) files
Argument 2: file type of input files
Argument 3: output_file_name (mainly used for creating dictionary/caching files)
Argument 4: directory of termlolator (like run_termolator.sh)
Argumeng 5: name of special topic area (like run_termolator.sh)
The program creates some of the same preprocessing files that run_termolator.sh creates, but does
not continue to produce a term list. I suspect that the most useful output files are the .terms
and .abbr files. The .terms files are the files listing the technical noun groups found in each
of the input files and the .abbr files are the list of abbreviation relations found.
C. There is a more experimental option. This has only been partially implemented and not completely
tested. Suppose, you want to run Phase 1, but you find that there is some other type of constituent,
e.g., named entities, that you can detect by some other means. Furthermore, suppose that Termolator is
making errors whereever there are named entities and you would like to eliminate terms that "conflict"
with these NEs. There is an additional python file that will remove any term from the "terms" files.
We have not tested this a lot yet, so it is not currently used in any of the shell scripts, so you
may have to customize a shell script if you want to incorporate this in a run (either with
run_termolator.sh or with run_terms_only.sh).
In find_terms.py, there is a function find_inline_terms_for_file_list, which is called by
run_find_inline_terms.py (by the shell scripts). That function takes an optional keyword argument
"ne_filter_ending". If you set that ending to a file type, e.g., ".ne" or the like. The program
will look for files of that type. Those files will be assumed to include lines of XML with "start"
and "end" values, e.g.,
<citation id="108713_1" entry_type="standard_case" start="1" end="20" reporter="U.S." standard_reporter="U.S." volume="410" page_number="113" year="1973" line="2">410 U.S. 113 (1973)</citation>
The values other than start and end do not matter. The terms collected will not be constrained so
they do not include any strings between instances of "start" and "end" found in that file.
8) Runtime
There are the following factors to consider.
First of all, increasing the number of foreground and background files
above can slow down the system, especially above 5000 files. We have
so far not found it to be noticeably advantageous to exceed 5000
files. Good results can also be obtained with fewer files, but we have
not identified an ideal amount of data to use -- we suspect that may
depend on a wide range of factors including how different your
foreground are.
Secondly, the preprocessing and distributional stages can take longer
with larger file sizes. If you use the same background files for
different sets of terms, you can save processing time, by not
processing these files more than once. The first time you run with a
set of background files, you set Argument 5 to True (as in all the
examples above). However, for subsequent uses of these background
files, you can set Argument 5 to False (provided your list of
background files point to files that have been processed). We have
found that preprocessing takes about 120 megabytes per hour on our
current PCs (2.53 ghz Xenon) or 3 hours for 360 megabytes (5000
patents). Assuming that the foreground and background corpora are
approximately the same size, we estimate that foreground corpora will
take about 6 hours to do preprocessing on both. The distributional
processing will take about 11 hours for files of this size. Note that
the distributional processing will be shorter (we have not measured
this yet) if you are running on a previously processed background
corpus and use the associated cache (.pkl) file.
Thirdly, as we have already demonstrated, use of the Relevance score
as determined by Arguments 6 and 7 effect total processing time. Each
web search takes about .8 seconds and there is one web search per
term, so 30K terms can take about 7 hours. For this reason, we
currently set a hard limit on the total number of terms in the output
to the minimum of: a) 30% of the terms produced by the previous stage
(these tend to be higher quality) and b) 30K terms (we assume it is
not worth it to wait longer than this for the results). When run on
5000 foreground and background files, the system typically ends up
generating relevance scores for 20-30K terms. Also the system will
create a dictionary of Relevance Scores for terms. You can keep one
such dictionary for all runs of termolator and send it to Termolator
via Argument 12. Each run will update this dictionary of Relevance
scores and prevent terms from being web-searched more than
once. Alternatively, if you set Argument 12 to False, variable $4 will
be used to create a dictionary of Relevance scores called
$4.outputweb.score. Either way, re-using relevance scores can result
in a significant speed-up.
In summary, the example described here would take about 24 hours to
run. However, the following factors could lead to substantially
shorter runtimes: (a) reusing preprocessed background data; (b) using
fewer than 5000 files; (c) opting for fewer than 30K terms for using
the websearch metric (see below); and (d) reusing relevance scores via
a dictionary.
9) Known Issues relating to the Relevance Score (arguments 6 and 7)
A) Internet Outages can cause the cause the system to fail.
B) The score is based on free Yahoo! web searches. If Yahoo! changes
the way search results are printed out, this can result in the
system not working. In particular, the function "get_top_ten" in
webscore.py may need to be changed to reflect such future changes.
C) If extensive use is made of the system, Yahoo! may ban you from
using their search engine for a time. We have used 40K or so
searches at a time and not had this happen. However, if it did,
a different way of accessing the search engine would be
necessary and the setting up of a Yahoo account would be
needed. Last time we checked, it would cost approximately .08
cents per search through https://yboss.yahooapis.com/ysearch/web
and further adjustments to the code would be required to make
this work.
10) Known Issues with the POS tagger. We currently assume a maximum
file size of 500000 bytes. This can be changed by editing the
TERMOLATOR_POS.properties file. However, our current POS tagger
uses lots of memory for large files. So it is not advisable to
raise this amount by a lot. An alternative is to shorten or split
very large files when using them with this system.
11) We experienced the following issue with version 0.1 which used
Python 2 for nltk. We have not experienced it yet with version 0.2
(which only uses Python 3) and do not currently know whether it is
still an issue. The distributional system seems to clash with some
instances of NLTK, but not others. We suspect that there may be
some default encoding settings somewhere, which could default to
ascii, rather than utf-8 for some setups.
For problematic setups, we have gotten the following error messages
during the distributional phase of the system:
"/usr/local/lib/python2.7/dist-packages/nltk/stem/porter.py", line 289, in _step1ab
if word.endswith("ied"):
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 7: ordinal not in range(128)
Once again, we have not seen this error yet with the current all-python-3 version.
12) One of the newest features is the argument 14 (MINIMUM_PROBABILITY_OR_FALSE) feature. This feature is used to cause the system to ignore text that is too different from "normal" text. Normal text is defined according to an N-gram character-based language model. A number of character-based language models were tested, but the one that worked the best so far, assumed exactly 5 distinctions between characters: LETTER, WHITESPACE, DIGIT, PUNCTUATION, OTHER and looked for the average probability of 5-grams of these characters within a block of text. This, for example, tended to rule out text with an excess of punctuation and spaces. This language model was run on the Open American National Corpus text (see http://www.anc.org/) in the file all-OANC.txt. The average 5-gram probability is calculated for each segment in that corpus. The results are averaged and the standard deviation is calculated. This information is then used to classify sections in other text.
The files gen2_lang.model and OANC.profile2 are derived in this manner and currently used to classify input to Termolator. The function "train_on_OANC" in "make_language_model.py" can be used to derive the files and "print_OANC_demo" can be used to gain some insight into how this works. Other language models which we did not end up using are also present in "make_language_model.py".