-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restore src_features for v3.0 #2308
Conversation
TODOs:
|
…s and model sizes
@guillaumekln @francoishernandez following an offline discussion with @anderleich, since we are now coming back to the old LUA way to handle features (much easier) we are going to get rid of the "inferfeats" transform taking into consideration that the pyonmttok Tokenizer DO handle the tokenization with multiple features attached to the source / target files. EDIT: maybe not, since some may still want to use SP or BPE legacy. |
ok I tested it and it works fine. |
Great news! I'm trying to modify the server to allow source features again.
The way inputs are passed to the I'm willing to make the necesary changes but it will imply several changes wich might not fit this PR. Maybe we can just merge this PR without the server and open a new one to make the server functional again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work!
Few comments below.
@guillaumekln while implementing the source features @anderleich changed the features vocabs into list of vocabs vs Dict.
|
You should do what's easier and clearer for this codebase. We can always update the CT2 converter later. |
This PR intends to restore source features support for OpenNMT-py v3.0. All the code has been adapted for this new version.
Source features support has been refactored for a more simplified handling of features. The way features are passed to the system has been changed and now features are appended to the actual textual data instead of providing a separate file. This also simplifies the way features are passed during inference and to the server. It uses the special character
│
as a feature separator, as in the previous versions of the OpenNMT framework. For instance:I've also added a way to provide default values for features. This can be really useful when mixing task specific data (with features) with general data which has not been annotated. Additionally, the
filterfeats
transform is no longer required and features are checked in the corpus loading process.A YAML configuration file would look like this:
Note! This PR is the first step as discussed in #2289