Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restore src_features for v3.0 #2308

Merged
merged 9 commits into from
Feb 10, 2023

Conversation

anderleich
Copy link
Contributor

This PR intends to restore source features support for OpenNMT-py v3.0. All the code has been adapted for this new version.

Source features support has been refactored for a more simplified handling of features. The way features are passed to the system has been changed and now features are appended to the actual textual data instead of providing a separate file. This also simplifies the way features are passed during inference and to the server. It uses the special character as a feature separator, as in the previous versions of the OpenNMT framework. For instance:

 I│1│3 love│0│1 eating│0│1 pizza│0│1

I've also added a way to provide default values for features. This can be really useful when mixing task specific data (with features) with general data which has not been annotated. Additionally, the filterfeats transform is no longer required and features are checked in the corpus loading process.

A YAML configuration file would look like this:

data:
    train:
        path_src: src_with_features.txt  #  I│1│3 love│0│1 eating│0│1 pizza│0│1
        path_tgt: tgt.txt # Me gusta comer pizza
        transforms: [onmt_tokenize, inferfeats, filtertoolong]
    valid:
        path_src: src_with_features.txt
        path_tgt: tgt_with_features.txt
        transforms: [onmt_tokenize, inferfeats]

save_data: ./data
n_sample: -1

# # Vocab opts
src_vocab: data.vocab.src  # Automatically generates data.vocab.src.feat_0 and data.vocab.src.feat_1
tgt_vocab: data.vocab.tgt
n_src_feats: 2
src_feats_defaults: "0│1"
feat_merge: "sum"

Note! This PR is the first step as discussed in #2289

@anderleich
Copy link
Contributor Author

anderleich commented Feb 6, 2023

TODOs:

  • Update server
  • Update FAQs

@vince62s
Copy link
Member

vince62s commented Feb 7, 2023

@guillaumekln @francoishernandez following an offline discussion with @anderleich, since we are now coming back to the old LUA way to handle features (much easier) we are going to get rid of the "inferfeats" transform taking into consideration that the pyonmttok Tokenizer DO handle the tokenization with multiple features attached to the source / target files.
Are you guys ok with this approach ?

EDIT: maybe not, since some may still want to use SP or BPE legacy.

@vince62s
Copy link
Member

vince62s commented Feb 8, 2023

ok I tested it and it works fine.
@francoishernandez maybe you can have another quick review before merge
@anderleich there is still the server and we're good I think.

@anderleich
Copy link
Contributor Author

anderleich commented Feb 8, 2023

Great news! I'm trying to modify the server to allow source features again.
However, I've found some strange errors during the testing which made me wonder if the server was working correctly at the first place.
In the master branch (without changes) when I try to run the server I get this error:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper__index_select)

The way inputs are passed to the _translate() method is different in the case of the server. IterOnDevice is not being used here, so inputs are always left on cpu.

I'm willing to make the necesary changes but it will imply several changes wich might not fit this PR. Maybe we can just merge this PR without the server and open a new one to make the server functional again.

Copy link
Member

@francoishernandez francoishernandez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!
Few comments below.

docs/source/FAQ.md Outdated Show resolved Hide resolved
docs/source/FAQ.md Show resolved Hide resolved
onmt/inputters/text_utils.py Show resolved Hide resolved
onmt/translate/translation_server.py Outdated Show resolved Hide resolved
onmt/translate/translation_server.py Outdated Show resolved Hide resolved
onmt/utils/parse.py Show resolved Hide resolved
@vince62s
Copy link
Member

@guillaumekln while implementing the source features @anderleich changed the features vocabs into list of vocabs vs Dict.
I know that now pyonmt Vocabs are pickable but what do you think ?

  1. we should stick to Dicts to make things the same wrt the word vocab and then CT2 converter does not require any update
  2. we can use Lists (maybe easier for code base) but then CT2 needs to be adapted
    What do you think ?
    cc: @francoishernandez

@guillaumekln
Copy link
Contributor

You should do what's easier and clearer for this codebase. We can always update the CT2 converter later.

@vince62s vince62s merged commit 62c96cc into OpenNMT:master Feb 10, 2023
@anderleich anderleich deleted the restore_source_features branch February 13, 2023 09:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants