Add an option to not encode sentencepiece during training/decoding al… #1003

XapaJIaMnu · 2023-07-28T13:23:44Z

Description

This PR adds the ability to train or decode with an a sentence that already has had spm_encode --model model.spm applied to it.

The benefit of this is that we can apply spm modifications prior to feeding the data to marian, giving us more flexibility than what SPM allows.

The code is minimally intrusive and doesn't change the behavior unless the flag is toggled on.

How to test

$ echo "Die Liste der Partien der Schachweltmeisterschaft 1986 führt sämtliche Partien auf, die beim Wettkampf um den Weltmeistertitel im Schach zwischen dem seit 1985 amtierenden Weltmeister Garri Kasparow und dem Herausforderer Anatoli Karpow (beide Sowjetunion) gespielt wurden." | ~/marian-dev/build/spm_encode --model vocab.deen.spm |  ~/marian-dev/build/marian-decoder -c model.npz.best-bleu.npz.decoder.yml --mini-batch 1 --maxi-batch 1 --quiet --quiet-translation --no-spm-encode
The list of World Chess Championship games 1986 lists all the games played in the competition for the World Chess Championship title between the world champion Garri Kasparov and the challenger Anatoly Karpow (both of the Soviet Union) since 1985.

$ echo "Die Liste der Partien der Schachweltmeisterschaft 1986 führt sämtliche Partien auf, die beim Wettkampf um den Weltmeistertitel im Schach zwischen dem seit 1985 amtierenden Weltmeister Garri Kasparow und dem Herausforderer Anatoli Karpow (beide Sowjetunion) gespielt wurden." |  ~/marian-dev/build/marian-decoder -c model.npz.best-bleu.npz.decoder.yml --mini-batch 1 --maxi-batch 1 --quiet --quiet-translation
The list of World Chess Championship games 1986 lists all the games played in the competition for the World Chess Championship title between the world champion Garri Kasparov and the challenger Anatoly Karpow (both of the Soviet Union) since 1985.

Checklist

I have tested the code manually
I have run regression tests
I have read and followed CONTRIBUTING.md
I have updated CHANGELOG.md

…lowing passing of spmIDs directly

graemenail · 2023-07-28T14:18:50Z

If we prefer to produce output in SPM pieces then we could use this for mapping

jelmervdl · 2023-07-28T16:31:56Z

I was initially of the opinion that token ids would be safer, but looking at how byte fallback pieces look, I'd say pieces are fine. Maybe even better because they're somewhat human readable and you can see what's going on.

>>> spm.encode('🤣', out_type=str)
['▁', '<0xF0>', '<0x9F>', '<0xA4>', '<0xA3>']
>>> spm.encode('🤣', out_type=int)
[275, 247, 166, 171, 170]

XapaJIaMnu · 2023-07-28T17:07:34Z

Updated to use spm pieces as opposed to spm vocab ids so that the input can also be somewhat human readable.

ZJaume · 2023-07-28T17:50:53Z

Careful, in SP models without byte fallback, unknown characters are left as they are, instead of using unk token, when tokenizing into pieces:

>>> spm.encode('ç', out_type=int)
[25, 0]
>>> spm.encode('ç', out_type=str)
['▁', 'ç']

emjotde · 2023-07-28T19:13:43Z

There is basically already a way to do that. If you use spm_export --model bla.spm | cut -f 1 -d ' ' > bla.txt you can just do marian -v bla.txt bla.txt -t src.spm_tokenized.txt tgt.spm_tokenized.txt

emjotde

Blocking for now until comment about using existing capabilities is resolved.

XapaJIaMnu · 2023-07-31T14:32:47Z

Hi, so the goal of this change is to allow for training to be done on SPM-d corpora, while translation/validation still happens on de-SPM-d corpora so you can get accurate BLEU scores.

It also brings parity to the -no-spm-decode option. Technically both of those could be achieved by transforming the spm vocabulary into simple vocabulary, but we have an option for -no-spm-decode, yet no option for -no-spm-encode

XapaJIaMnu · 2023-07-31T22:21:39Z

@ZJaume , I just tested with sentencepiece and vocab that doesn't have unicode backoff, and it seems that it does indeed encode pass through unks:

$ cat test.bg | ~/marian-dev/build/spm_encode --model vocab.esen.spm 
▁en g lish ▁text ▁ бг ▁ текст ▁ 靐
$ cat test.bg 
english text бг текст  靐

$ cat test.bg | ~/marian-dev/build/spm_encode --model vocab.esen.spm | ~/marian-dev/build/marian-decoder -m model.npz -v vocab.esen.spm vocab.esen.spm  --mini-batch 1 --maxi-batch 1 --cpu-threads 1 --no-spm-encode --quiet --quiet-translation
texto english
$ cat test.bg | ~/marian-dev/build/marian-decoder -m model.npz -v vocab.esen.spm vocab.esen.spm  --mini-batch 1 --maxi-batch 1 --cpu-threads 1 --quiet --quiet-translation
texto english

I also looked at the source and spm_->PieceToId() which we use to produce the vocabID can generate unks, as seen here: https://github.com/marian-nmt/marian-dev/blob/master/src/data/sentencepiece_vocab.cpp#L50

In light of this, I think this is ready to merge.

snukky · 2023-08-16T12:59:37Z

@XapaJIaMnu Will you resolve the conflicts (seem simple) and update the patch number in the VERSION file, or would you prefer me to do that? I can then merge.

XapaJIaMnu · 2023-08-17T09:00:21Z

I think i fixed it @snukky .

XapaJIaMnu added 2 commits July 28, 2023 13:55

Add an option to not encode sentencepiece during training/decoding al…

e878823

…lowing passing of spmIDs directly

Update changelog

29d5d60

XapaJIaMnu requested review from emjotde and snukky July 28, 2023 13:23

numbers -> pieces

5d7d080

emjotde self-assigned this Jul 28, 2023

emjotde requested changes Jul 28, 2023

View reviewed changes

snukky approved these changes Jul 31, 2023

View reviewed changes

jelmervdl mentioned this pull request Aug 9, 2023

Alignment passthrough hplt-project/OpusTrainer#26

Merged

merge with master

e80b6bb

snukky merged commit 961a728 into master Aug 17, 2023
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add an option to not encode sentencepiece during training/decoding al… #1003

Add an option to not encode sentencepiece during training/decoding al… #1003

XapaJIaMnu commented Jul 28, 2023 •

edited

Loading

graemenail commented Jul 28, 2023

jelmervdl commented Jul 28, 2023

XapaJIaMnu commented Jul 28, 2023

ZJaume commented Jul 28, 2023

emjotde commented Jul 28, 2023 •

edited

Loading

emjotde left a comment

XapaJIaMnu commented Jul 31, 2023

XapaJIaMnu commented Jul 31, 2023 •

edited

Loading

snukky commented Aug 16, 2023

XapaJIaMnu commented Aug 17, 2023

Add an option to not encode sentencepiece during training/decoding al… #1003

Add an option to not encode sentencepiece during training/decoding al… #1003

Conversation

XapaJIaMnu commented Jul 28, 2023 • edited Loading

Description

How to test

Checklist

graemenail commented Jul 28, 2023

jelmervdl commented Jul 28, 2023

XapaJIaMnu commented Jul 28, 2023

ZJaume commented Jul 28, 2023

emjotde commented Jul 28, 2023 • edited Loading

emjotde left a comment

Choose a reason for hiding this comment

XapaJIaMnu commented Jul 31, 2023

XapaJIaMnu commented Jul 31, 2023 • edited Loading

snukky commented Aug 16, 2023

XapaJIaMnu commented Aug 17, 2023

XapaJIaMnu commented Jul 28, 2023 •

edited

Loading

emjotde commented Jul 28, 2023 •

edited

Loading

XapaJIaMnu commented Jul 31, 2023 •

edited

Loading