Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an option to not encode sentencepiece during training/decoding al… #1003

Merged
merged 4 commits into from
Aug 17, 2023

Conversation

XapaJIaMnu
Copy link
Contributor

@XapaJIaMnu XapaJIaMnu commented Jul 28, 2023

Description

This PR adds the ability to train or decode with an a sentence that already has had spm_encode --model model.spm applied to it.

The benefit of this is that we can apply spm modifications prior to feeding the data to marian, giving us more flexibility than what SPM allows.

The code is minimally intrusive and doesn't change the behavior unless the flag is toggled on.

How to test

$ echo "Die Liste der Partien der Schachweltmeisterschaft 1986 führt sämtliche Partien auf, die beim Wettkampf um den Weltmeistertitel im Schach zwischen dem seit 1985 amtierenden Weltmeister Garri Kasparow und dem Herausforderer Anatoli Karpow (beide Sowjetunion) gespielt wurden." | ~/marian-dev/build/spm_encode --model vocab.deen.spm |  ~/marian-dev/build/marian-decoder -c model.npz.best-bleu.npz.decoder.yml --mini-batch 1 --maxi-batch 1 --quiet --quiet-translation --no-spm-encode
The list of World Chess Championship games 1986 lists all the games played in the competition for the World Chess Championship title between the world champion Garri Kasparov and the challenger Anatoly Karpow (both of the Soviet Union) since 1985.

$ echo "Die Liste der Partien der Schachweltmeisterschaft 1986 führt sämtliche Partien auf, die beim Wettkampf um den Weltmeistertitel im Schach zwischen dem seit 1985 amtierenden Weltmeister Garri Kasparow und dem Herausforderer Anatoli Karpow (beide Sowjetunion) gespielt wurden." |  ~/marian-dev/build/marian-decoder -c model.npz.best-bleu.npz.decoder.yml --mini-batch 1 --maxi-batch 1 --quiet --quiet-translation
The list of World Chess Championship games 1986 lists all the games played in the competition for the World Chess Championship title between the world champion Garri Kasparov and the challenger Anatoly Karpow (both of the Soviet Union) since 1985.

Checklist

  • I have tested the code manually
  • I have run regression tests
  • I have read and followed CONTRIBUTING.md
  • I have updated CHANGELOG.md

@graemenail
Copy link
Member

If we prefer to produce output in SPM pieces then we could use this for mapping

@jelmervdl
Copy link
Contributor

I was initially of the opinion that token ids would be safer, but looking at how byte fallback pieces look, I'd say pieces are fine. Maybe even better because they're somewhat human readable and you can see what's going on.

>>> spm.encode('🤣', out_type=str)
['▁', '<0xF0>', '<0x9F>', '<0xA4>', '<0xA3>']
>>> spm.encode('🤣', out_type=int)
[275, 247, 166, 171, 170]

@XapaJIaMnu
Copy link
Contributor Author

Updated to use spm pieces as opposed to spm vocab ids so that the input can also be somewhat human readable.

@ZJaume
Copy link

ZJaume commented Jul 28, 2023

Careful, in SP models without byte fallback, unknown characters are left as they are, instead of using unk token, when tokenizing into pieces:

>>> spm.encode('ç', out_type=int)
[25, 0]
>>> spm.encode('ç', out_type=str)
['▁', 'ç']

@emjotde
Copy link
Member

emjotde commented Jul 28, 2023

There is basically already a way to do that. If you use spm_export --model bla.spm | cut -f 1 -d ' ' > bla.txt you can just do marian -v bla.txt bla.txt -t src.spm_tokenized.txt tgt.spm_tokenized.txt

@emjotde emjotde self-assigned this Jul 28, 2023
Copy link
Member

@emjotde emjotde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking for now until comment about using existing capabilities is resolved.

@XapaJIaMnu
Copy link
Contributor Author

Hi, so the goal of this change is to allow for training to be done on SPM-d corpora, while translation/validation still happens on de-SPM-d corpora so you can get accurate BLEU scores.

It also brings parity to the -no-spm-decode option. Technically both of those could be achieved by transforming the spm vocabulary into simple vocabulary, but we have an option for -no-spm-decode, yet no option for -no-spm-encode

@XapaJIaMnu
Copy link
Contributor Author

XapaJIaMnu commented Jul 31, 2023

@ZJaume , I just tested with sentencepiece and vocab that doesn't have unicode backoff, and it seems that it does indeed encode pass through unks:

$ cat test.bg | ~/marian-dev/build/spm_encode --model vocab.esen.spm 
▁en g lish ▁text ▁ бг ▁ текст ▁ 靐
$ cat test.bg 
english text бг текст  靐

$ cat test.bg | ~/marian-dev/build/spm_encode --model vocab.esen.spm | ~/marian-dev/build/marian-decoder -m model.npz -v vocab.esen.spm vocab.esen.spm  --mini-batch 1 --maxi-batch 1 --cpu-threads 1 --no-spm-encode --quiet --quiet-translation
texto english
$ cat test.bg | ~/marian-dev/build/marian-decoder -m model.npz -v vocab.esen.spm vocab.esen.spm  --mini-batch 1 --maxi-batch 1 --cpu-threads 1 --quiet --quiet-translation
texto english

I also looked at the source and spm_->PieceToId() which we use to produce the vocabID can generate unks, as seen here: https://github.com/marian-nmt/marian-dev/blob/master/src/data/sentencepiece_vocab.cpp#L50

In light of this, I think this is ready to merge.

@snukky
Copy link
Member

snukky commented Aug 16, 2023

@XapaJIaMnu Will you resolve the conflicts (seem simple) and update the patch number in the VERSION file, or would you prefer me to do that? I can then merge.

@XapaJIaMnu
Copy link
Contributor Author

I think i fixed it @snukky .

@snukky snukky merged commit 961a728 into master Aug 17, 2023
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants