Hi, I'm struggling creating the data the way you described. I followed the instructions closely and the data after preprocessing with fairseq looks like this:

Some line of test.en:

to these non-@@ engineers , li@@ tt@@ leb@@ its became another material , electr@@ on@@ ics became just another material .

When I preprocess the data afterwards with parse_nmt.py I get the following tree. See that most BPE-tokens (for example in non-engineers ) are not applied, but others are (electr@@). The resulting vocab is 40k, which is nowhere near the 10k from my BPE-codes.

(ROOT (S (PP (TO to) (NP (DT these) (NNS non-engineers))) (PRN (, ,) (S (NP (NNS littlebits)) (VP (VBD became) (NP (DT another) (NN material)))) (, ,)) (NP (NNS (NNS_bpe electr@@) (NNS_bpe on@@) (NNS_bpe ics))) (VP (VBD became) (NP (RB just) (DT another) (NN material))) (. .)))

The same tree, before-bpe, looks like this:

(ROOT (S (PP (TO to) (NP (DT these) (NNS non-engineers))) (PRN (, ,) (S (NP (NNS littlebits)) (VP (VBD became) (NP (DT another) (NN material)))) (, ,)) (NP (NNS electronics)) (VP (VBD became) (NP (RB just) (DT another) (NN material))) (. .)))

My tries

I thought about reapplying the BPE, so I executed parse_nmt.py with the --convert_bpe option. This applies BPE to all the missing tokens, but also re-applies bpe to the already bpe'd tokens:

(ROOT (S (PP (TO to) (NP (DT these) (NNS (NNS_bpe non-@@) (NNS_bpe engineers)))) (PRN (, ,) (S (NP (NNS (NNS_bpe li@@) (NNS_bpe tt@@) (NNS_bpe leb@@) (NNS_bpe its))) (VP (VBD became) (NP (DT another) (NN material)))) (, ,)) (NP (NNS (NNS_bpe (NNS_bpe_bpe electr@@) (NNS_bpe_bpe @@@) (NNS_bpe_bpe @)) (NNS_bpe (NNS_bpe_bpe on@@) (NNS_bpe_bpe @@@) (NNS_bpe_bpe @)) (NNS_bpe ics))) (VP (VBD became) (NP (RB just) (DT another) (NN material))) (. .)))

This produces junk for the tokens where BPE has been applied in the previous step. See for example: (NNS_bpe (NNS_bpe_bpe electr@@) (NNS_bpe_bpe @@@) (NNS_bpe_bpe @)) (NNS_bpe (NNS_bpe_bpe on@@) (NNS_bpe_bpe @@@) (NNS_bpe_bpe @)) (NNS_bpe ics)))

Question

How should I preprocess the IWSLT data to get the correct BPE'd tree?

Thanks!

BPE codes #1

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions