-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathREADME
185 lines (130 loc) · 6.09 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
Morfessor 2.0 - Quick start
===========================
Installation
------------
Morfessor 2.0 is installed using setuptools library for Python. To
build and install the module and scripts to default paths, type
python setup.py install
For details, see http://docs.python.org/install/
Documentation
-------------
User instructions for Morfessor 2.0 are available in the docs directory
as Sphinx source files (see http://sphinx-doc.org/). Instructions how
to build the documentation can be found in docs/README.
The documentation is also available on-line at http://morfessor.readthedocs.org/
Morfessor EM+Prune
------------------
This branch includes the modifications to Morfessor that enable
training using Expectation Maximization and Pruning.
Morfessor EM+Prune training achieves better Morfessor cost than the
earlier local search algorithm.
A simple usage example ::
# Create 1M substring seed lexicon direct from a pretokenized corpus
freq_substr.py --lex-size 1000000 < corpus > freq_substr.1M
# Perform Morfessor EM+Prune training. Autotuning with 10k lexicon size.
morfessor \
--em-prune freq_substr.1M \
-t corpus \
--num-morph-types 10000 \
--save-segmentation emprune.model
# Segment data using the Viterbi algorithm
morfessor-segment \
testdata \
--em-prune emprune.model \
--output segmented.testdata
Additional options for freq_substr.py ::
--traindata-list
Training data is a list of word types preceded by counts, not a corpus.
--prune-redundant "-1"
Setting prune-redundant to -1 disables pre-pruning of redundant substrings.
Note the quotes, to prevent the dash from being interpreted as a flag.
--forcesplit-before XYZ
Force a splitting point before the characters X, Y and Z
--forcesplit-after XYZ
Force a splitting point after the characters X, Y and Z
--forcesplit-both XYZ
Force a splitting point both before and after the characters X, Y and Z
Note that hyphens are NOT force split by default anymore,
to get the same forcesplitting as Baseline,
you need to specify --forcesplit-both "-"
Additional options for EM+Prune training ::
--traindata-list
Training data is a list of word types preceded by counts, not a corpus.
--prune-criterion {mdl,autotune,lexicon}
mdl: (alpha-weighted) Minimum Description Length pruning.
autotune: MDL with automatic tuning of alpha for lexicon size.
If you want a fixed lexicon size, use this.
Use --num-morph-types to specify size of lexicon.
lexicon: lexicon size with omitted prior or pretuned alpha.
You probably want "autotune" instead.
--num-morph-types N
Goal lexicon size.
--prune-proportion 0.2
How large proportion of lexicon to prune in each epoch.
--em-subepochs 3
How many sub-epochs of EM to perform.
--expected-freq-threshold 0.5
Also prune subwords with expected count less than this.
--lateen {none,full,prune}
Lateen EM training mode.
none: "soft" EM (default)
full: Lateen-EM
prune: EM+Viterbi-prune
--no-bayesianify
Leave out the Bayesian EM exp digamma transformation of expected counts.
--no-lexicon-cost
Omit prior entirely.
--freq-distr-cost {baseline,omit}
Frequency distribution prior to use.
baseline: Approximate Morfessor Baseline prior (default).
omit: set frequency distribution cost to zero.
--save-pseudomodel
use the trained EM+Prune model to segment the training data,
and save the resulting segmentation as if it was a Morfessor Baseline model.
Additional options for segmentation ::
--sample-nbest
Sample alternative segmentations from n-best list.
Approximates --sample, but is much faster.
--sample
Sample from full distribution. You probably want --sample-nbest instead.
--sampling-temperature 0.5
(Inverted) temperature parameter for sampling. (1.0 = unsmoothed)
A note on pretokenization and boundary markers ::
Morfessor EM+Prune is typically used with *word* boundary markers (marks where the whitespace should go), rather than the *morph* boundary markers (marks word-internal boundaries) used by previous Morfessors.
Make sure that the word boundary markers are present in the corpus / word count lists used for Morfessor EM+Prune training, and also in
the input to Morfessor EM+Prune during segmentation.
Some ways to achieve this is to use the pyonmttok library with spacer_annotate=True and joiner_annotate=False,
or the dynamicdata dataloader with pretokenize=True.
This will insert '▁' (unicode lower one eight block \u2581) as word boundary markers.
Also remember to adjust your detokenization post-processing script appropriately.
Contact
-------
Questions or feedback? Email: [email protected]
Citing
------
If you use the Morfessor EM+Prune training algorithm, please cite
@inproceedings{gronroos2020morfessor,
title={Morfessor {EM+Prune}: Improved Subword Segmentation with Expectation Maximization and Pruning},
author = {Gr{\"o}nroos, Stig-Arne and Sami Virpioja and Mikko Kurimo},
year = {2020},
month = {may},
address = {Marseilles, France},
booktitle = {Proceedings of the 12th Language Resources and Evaluation Conference},
publisher = {ELRA},
}
ArXiv preprint available online at
https://arxiv.org/abs/2003.03131
For the original Morfessor 2.0: Python implementation, please cite
@techreport{virpioja2013morfessor,
address = {Helsinki, Finland},
type = {Report},
title = {Morfessor 2.0: Python Implementation and Extensions for Morfessor Baseline},
language = {eng},
number = {25/2013 in Aalto University publication series SCIENCE + TECHNOLOGY},
institution = {Department of Signal Processing and Acoustics, Aalto University},
author = {Virpioja, Sami and Smit, Peter and Grönroos, Stig-Arne and Kurimo, Mikko},
year = {2013},
pages = {38}
}
The report is available online at
http://urn.fi/URN:ISBN:978-952-60-5501-5