-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME
181 lines (152 loc) · 9.63 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
Portage-SMT-TAS
Traitement multilingue de textes / Multilingual Text Processing
Centre de recherche en technologies numériques / Digital Technologies Research Centre
Conseil national de recherches Canada / National Research Council Canada
Copyright 2004-2021, Sa Majesté la Reine du Chef du Canada
Copyright 2004-2021, Her Majesty in Right of Canada
MIT License - see LICENSE
See NOTICE for the Copyright notices of 3rd party libraries.
Français: LISEZMOI
Description
Portage is a project to explore new techniques in Statistical Machine
Translation (SMT), and to develop state-of-the-art SMT technology that can be
used commercially, as well as to support other projects at NRC and elsewhere.
Various SMT systems were developed under the Portage project. PORTAGEshared
was first released to Canadian universities to build and train a community of
users. Commercial releases of PORTAGEshared 1.1 to 1.3 were made, followed by
Portage 1.4.
PortageII marked a leap forward in NRC's SMT technology. With version 1.0, we
brought in significant improvements to translation engine itself that result in
better translations, and contributed to our success at the NIST Open Machine
Translation 2012 Evaluation (OpenMT12). With version 2.0, we added two
important features to help translators: the transfer of markup tags from the
source sentence to the machine translation output, as well as handling of the
increasingly common xliff file format. 2.1 was a maintenance release while 2.2
brought in the fixed-terms module.
PortageII 3.0 marked another important step in the core SMT technology, bringing
in deep learning (with Neural Network Joint Models, or NNJMs), an improved
reordering model based on sparse features, coarse LMs and other improvement.
The NRC put significant efforts in optimizing the default training
parameters to improve the quality of the translations produced by PortageII 3.0
out of the box, and we expect users to see and appreciate the difference.
PortageII 4.0 brought in incremental document adaptation and training NNJMs on
your own data, with fully updated generic models.
Portage-SMT-TAS is the name we chose for the GitHub repo, now that we are
releasing the code base as open source under the MIT License. Despite the rise
of NMT, the Portage code base still includes a number of tools that can be
useful to the world of Machine Translation and parallel text processing. All
the names above might be found in various parts of the documentation and code,
and can be construed to refer to the same system, or one of its versions. When
you see PortageII_cur, that's a placeholder that our script for creating an
official release replaces by the actual version name and number.
Note that some of the Portage tools have been spliced into other GitHub repos:
- https://github.com/nrc-cnrc/PortageTextProcessing contains text processing
tools that are relevant for both NMT and SMT work.
- https://github.com/nrc-cnrc/PortageClusterUtils contains the multi-core and
high-performance cluster parallelization tools that were developped for
Portage, but can be used for other projects as well.
Portage-SMT-TAS includes a complete suite of software tools required for
phrase-based statistical machine translation, including:
- preprocessing of French, English, Spanish, Danish (tokenization,
detokenization, lowercasing, aligning), Chinese (handling of dates and
numbers), and Arabic (integration with MADA tokenizer);
- training lexical translation models (IBM models 1 and 2, HMM, plus
integration with MGiza++ for IBM4) from aligned corpora;
- training phrase-based translation models from aligned corpora and lexical
models;
- training lexicalized distortion models, and their hierarchical variant;
- training sparse features, including discriminative hierarchical distortion
models;
- using language models in text and binary format for various purposes;
- training, fine tuning and using Neural Network Joint Models (NNJM) to
improve translation quality;
- filtering language and translation models to retain only information needed
for a particular text to be translated;
- adapting language and translation models to reflect the characteristics of
the data currently being translated;
- decoding (doing the actual translation, producing n-best lists or lattices);
- rescoring (choosing the best hypothesis from an n-best list) using sources
of information that are external to the decoder; a collection of rescoring
features;
- optimizing weights, both for decoding and for rescoring, using a variety of
tuning algorithms, including N-Best MERT, Lattice and N-Best MIRA and a few
more;
- truecasing (restoring upper case letters in the output of the translation);
- evaluation using BLEU, WER or PER.
- a tightly packed representation of various model files (LMs, TMs, (H)LDMs,
suffix arrays) for instant loading and fast access via memory-mapped I/O;
- a confidence estimation module to compute a per-sentence quality
prediction/estimation of the translations produced by Portage;
- an API to incorporate the decoder into other applications using SOAP or cgi;
a web-ready demo;
- PortageLive -- tools to setup a runtime SMT server using Portage, accessible
via a SOAP API, via CGI scripts or a command line interface;
- incremental document model adaptation;
- code to handle translation projects in the TMX and XLIFF format, with
optional transfer of formatting tags from the source text to the output.
- an experimental framework -- a collection of tools to automate running all
the components required to train a full machine translation system (see
https://github.com/nrc-cnrc/PortageTrainingFramework);
- generic en->fr and fr->en models to supplement in-domain training corpora,
for improved translation quality, especially for smaller corpora;
- lots of general utility code needed for the above.
Contents of this directory
bin/ executable code (once compiled)
doc/ documentation (once generated)
framework/ framework for training PortageII and doing experiments [TODO - this will be a separate GitHub repo]
generic-model/ location to install PortageGenericModel 2.0 [TODO - where will this be?]
lib/ run-time libraries (once compiled)
logs/ place-holder for runtime logs
PortageLive/ files needed to setup a runtime translation server
src/ source code
test-suite/ various test suites
third-party/ non-NRC software distributed with PortageII for convenience
tmx-prepro/ tools to extract and preprocess training corpora from TMX files
[TODO - this will be a separate GitHub repo]
INSTALL installation instructions
NOTICE third party Copyright notices
README this file
RELEASES release history with change notes
SETUP.bash config file to activate Portage-SMT-TAS
Getting started
The framework subdirectory [TODO - fix] provides a PortageII training and experiment
framework. We recommend you use this framework as your starting point for real
experiments, and as your default configuration for training PortageII systems
for use with PortageLive. A tutorial is included showing how to run a toy
experiment with this framework: tutorial.pdf. Even if you used PortageII 2.2
or earlier before, we recommend you read through this document, as it
highlights the currently recommended procedures for using PortageII. Please do
not use our old toy example as a starting point: start from the framework
instead.
For further documentation, look at the contents of doc/.
Upgrading from a previous version
If you have models that were created using a version of PortageII older than
3.0, we strongly recommend that you retrain new models with the current
version, to take advantage of the new features.
Between PortageII 2.x and PortageII 3.0, we have conducted extensive
experiments to optimize the training procedures, and updated the framework
defaults to reflect the configuration we now recommend. If you have customized
or created templates from the framework's Makefile.params file, please update
them to reflect the changes in PortageII 3.0. If you use custom plugins,
please review the new default plugins, which might meet your needs out of the
box.
We now recommend (and enable by default) the following settings:
- Use the new sparse features.
- Use the new Coarse LM models.
- Use IBM4 word-alignments, along with IBM2 and HMM ones (require MGiza++).
- If the target language is French or English, use the generic LM as
background model in a MixLM with your main language model (note: this will
not introduce out-of-domain terminology in your translations, rather, it
will help choose among competing alternatives from your own in-domain data).
For the language model, the framework now has a PRIMARY_LM variable that you
can use instead of TRAIN_LM or MIXLM: when you use it, the framework will
automatically use the generic LM if you're translating into French or English,
and it will automatically use MixLMs instead of regular ones as soon as there
is more than one component LM or if the generic LM applies.
A matrix showing the evolution of the SOAP API, plugins and training features
is now provided: review doc/PortageAPIComparison.pdf when upgrading.
The PortageLive SOAP API has changed significantly with version 3.0 as well.
Review the updated PortageLiveAPI.wsdl, doc/PortageAPIComparison.pdf and the
php scripts in PortageLive/www/soap when updating your applications to work
with the PortageII 3.0. When installing the updated API, remove compiled WSDL
files cached by Apache in /tmp/wsdl-*.