Restore src_features for v3.0 #2308

anderleich · 2023-02-06T12:56:05Z

This PR intends to restore source features support for OpenNMT-py v3.0. All the code has been adapted for this new version.

Source features support has been refactored for a more simplified handling of features. The way features are passed to the system has been changed and now features are appended to the actual textual data instead of providing a separate file. This also simplifies the way features are passed during inference and to the server. It uses the special character ￨ as a feature separator, as in the previous versions of the OpenNMT framework. For instance:

 I￨1￨3 love￨0￨1 eating￨0￨1 pizza￨0￨1

I've also added a way to provide default values for features. This can be really useful when mixing task specific data (with features) with general data which has not been annotated. Additionally, the filterfeats transform is no longer required and features are checked in the corpus loading process.

A YAML configuration file would look like this:

data:
    train:
        path_src: src_with_features.txt  #  I￨1￨3 love￨0￨1 eating￨0￨1 pizza￨0￨1
        path_tgt: tgt.txt # Me gusta comer pizza
        transforms: [onmt_tokenize, inferfeats, filtertoolong]
    valid:
        path_src: src_with_features.txt
        path_tgt: tgt_with_features.txt
        transforms: [onmt_tokenize, inferfeats]

save_data: ./data
n_sample: -1

# # Vocab opts
src_vocab: data.vocab.src  # Automatically generates data.vocab.src.feat_0 and data.vocab.src.feat_1
tgt_vocab: data.vocab.tgt
n_src_feats: 2
src_feats_defaults: "0￨1"
feat_merge: "sum"

Note! This PR is the first step as discussed in #2289

anderleich · 2023-02-06T16:43:27Z

TODOs:

Update server
Update FAQs

…s and model sizes

vince62s · 2023-02-07T09:24:28Z

@guillaumekln @francoishernandez following an offline discussion with @anderleich, since we are now coming back to the old LUA way to handle features (much easier) we are going to get rid of the "inferfeats" transform taking into consideration that the pyonmttok Tokenizer DO handle the tokenization with multiple features attached to the source / target files.
Are you guys ok with this approach ?

EDIT: maybe not, since some may still want to use SP or BPE legacy.

vince62s · 2023-02-08T07:35:24Z

ok I tested it and it works fine.
@francoishernandez maybe you can have another quick review before merge
@anderleich there is still the server and we're good I think.

anderleich · 2023-02-08T11:31:45Z

Great news! I'm trying to modify the server to allow source features again.
However, I've found some strange errors during the testing which made me wonder if the server was working correctly at the first place.
In the master branch (without changes) when I try to run the server I get this error:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper__index_select)

The way inputs are passed to the _translate() method is different in the case of the server. IterOnDevice is not being used here, so inputs are always left on cpu.

I'm willing to make the necesary changes but it will imply several changes wich might not fit this PR. Maybe we can just merge this PR without the server and open a new one to make the server functional again.

francoishernandez

Nice work!
Few comments below.

docs/source/FAQ.md

onmt/inputters/text_utils.py

onmt/translate/translation_server.py

onmt/utils/parse.py

vince62s · 2023-02-10T10:35:53Z

@guillaumekln while implementing the source features @anderleich changed the features vocabs into list of vocabs vs Dict.
I know that now pyonmt Vocabs are pickable but what do you think ?

we should stick to Dicts to make things the same wrt the word vocab and then CT2 converter does not require any update
we can use Lists (maybe easier for code base) but then CT2 needs to be adapted
What do you think ?
cc: @francoishernandez

guillaumekln · 2023-02-10T10:45:36Z

You should do what's easier and clearer for this codebase. We can always update the CT2 converter later.

Restored src_features for v3.0

e27ea63

anderleich mentioned this pull request Feb 6, 2023

[WIP] Support source and target features #2289

Closed

Fixed dnamic scoring with source fatures + small fixes

3e6f4b9

Add check to ensure consistencies between word/feature embedding size…

ba58b60

…s and model sizes

anderleich added 3 commits February 7, 2023 11:12

Fixed inferfeats transform

1f8b702

Updated the docs

43fd026

Updated the docs

2d8715f

Modified server to allow source features

59f484f

francoishernandez reviewed Feb 9, 2023

View reviewed changes

Added unittest for parse_features() and updated FAQ

408728f

Fixed translation_server to allow source features with CT2

a943bd5

vince62s merged commit 62c96cc into OpenNMT:master Feb 10, 2023

anderleich deleted the restore_source_features branch February 13, 2023 09:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Restore src_features for v3.0 #2308

Restore src_features for v3.0 #2308

Uh oh!

anderleich commented Feb 6, 2023

Uh oh!

anderleich commented Feb 6, 2023 •

edited

Loading

Uh oh!

vince62s commented Feb 7, 2023 •

edited

Loading

Uh oh!

vince62s commented Feb 8, 2023

Uh oh!

anderleich commented Feb 8, 2023 •

edited

Loading

Uh oh!

francoishernandez left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vince62s commented Feb 10, 2023

Uh oh!

guillaumekln commented Feb 10, 2023

Uh oh!

Uh oh!

Restore src_features for v3.0 #2308

Restore src_features for v3.0 #2308

Uh oh!

Conversation

anderleich commented Feb 6, 2023

Uh oh!

anderleich commented Feb 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vince62s commented Feb 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vince62s commented Feb 8, 2023

Uh oh!

anderleich commented Feb 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

francoishernandez left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vince62s commented Feb 10, 2023

Uh oh!

guillaumekln commented Feb 10, 2023

Uh oh!

Uh oh!

anderleich commented Feb 6, 2023 •

edited

Loading

vince62s commented Feb 7, 2023 •

edited

Loading

anderleich commented Feb 8, 2023 •

edited

Loading