Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BPE codes #1

Open
villmow opened this issue Apr 7, 2020 · 1 comment
Open

BPE codes #1

villmow opened this issue Apr 7, 2020 · 1 comment

Comments

@villmow
Copy link

villmow commented Apr 7, 2020

Hi,

nice project, thanks! I'm just trying to replicate your setup for IWSLT'14. Did you change the variable BPE_TOKENS in fairseqs prepare-iwslt14.sh to 32k as mentioned in your paper?

Are you willing to share your bpe codes with me?

Thanks

@villmow
Copy link
Author

villmow commented Apr 10, 2020

Hi, I'm struggling creating the data the way you described. I followed the instructions closely and the data after preprocessing with fairseq looks like this:

Some line of test.en:

to these non-@@ engineers , li@@ tt@@ leb@@ its became another material , electr@@ on@@ ics became just another material .

When I preprocess the data afterwards with parse_nmt.py I get the following tree. See that most BPE-tokens (for example in non-engineers ) are not applied, but others are (electr@@). The resulting vocab is 40k, which is nowhere near the 10k from my BPE-codes.

(ROOT (S (PP (TO to) (NP (DT these) (NNS non-engineers))) (PRN (, ,) (S (NP (NNS littlebits)) (VP (VBD became) (NP (DT another) (NN material)))) (, ,)) (NP (NNS (NNS_bpe electr@@) (NNS_bpe on@@) (NNS_bpe ics))) (VP (VBD became) (NP (RB just) (DT another) (NN material))) (. .)))

The same tree, before-bpe, looks like this:

(ROOT (S (PP (TO to) (NP (DT these) (NNS non-engineers))) (PRN (, ,) (S (NP (NNS littlebits)) (VP (VBD became) (NP (DT another) (NN material)))) (, ,)) (NP (NNS electronics)) (VP (VBD became) (NP (RB just) (DT another) (NN material))) (. .)))

My tries

I thought about reapplying the BPE, so I executed parse_nmt.py with the --convert_bpe option. This applies BPE to all the missing tokens, but also re-applies bpe to the already bpe'd tokens:

(ROOT (S (PP (TO to) (NP (DT these) (NNS (NNS_bpe non-@@) (NNS_bpe engineers)))) (PRN (, ,) (S (NP (NNS (NNS_bpe li@@) (NNS_bpe tt@@) (NNS_bpe leb@@) (NNS_bpe its))) (VP (VBD became) (NP (DT another) (NN material)))) (, ,)) (NP (NNS (NNS_bpe (NNS_bpe_bpe electr@@) (NNS_bpe_bpe @@@) (NNS_bpe_bpe @)) (NNS_bpe (NNS_bpe_bpe on@@) (NNS_bpe_bpe @@@) (NNS_bpe_bpe @)) (NNS_bpe ics))) (VP (VBD became) (NP (RB just) (DT another) (NN material))) (. .)))

This produces junk for the tokens where BPE has been applied in the previous step. See for example: (NNS_bpe (NNS_bpe_bpe electr@@) (NNS_bpe_bpe @@@) (NNS_bpe_bpe @)) (NNS_bpe (NNS_bpe_bpe on@@) (NNS_bpe_bpe @@@) (NNS_bpe_bpe @)) (NNS_bpe ics)))

Question

How should I preprocess the IWSLT data to get the correct BPE'd tree?

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant