Warning: Careful using a custom tokenizer... #122

PetrochukM · 2020-09-15T02:03:08Z

I tried to use the spaCy tokenizer, nltk word_tokenizer, sacremoses MosesTokenizer, nltk TreebankWordTokenizer, and nltk TweetTokenizer.

For this example, "inch BBL, unquote, cost $29.95" they will all output ['inch', 'BBL', ',', 'unquote', ',', 'cost', '$', '29.95', '.']. This output is incompatible with normalise because it'll predict "inch B B L, unquote, cost $twenty nine point nine five.".

The text was updated successfully, but these errors were encountered:

PetrochukM changed the title ~~Careful using a custom tokenizer...~~ Warning: Careful using a custom tokenizer... Sep 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Warning: Careful using a custom tokenizer... #122

Warning: Careful using a custom tokenizer... #122

PetrochukM commented Sep 15, 2020

Warning: Careful using a custom tokenizer... #122

Warning: Careful using a custom tokenizer... #122

Comments

PetrochukM commented Sep 15, 2020