You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I tried to use the spaCy tokenizer, nltk word_tokenizer, sacremosesMosesTokenizer, nltk TreebankWordTokenizer, and nltk TweetTokenizer.
For this example, "inch BBL, unquote, cost $29.95" they will all output ['inch', 'BBL', ',', 'unquote', ',', 'cost', '$', '29.95', '.']. This output is incompatible with normalise because it'll predict "inch B B L, unquote, cost $twenty nine point nine five.".
The text was updated successfully, but these errors were encountered:
PetrochukM
changed the title
Careful using a custom tokenizer...
Warning: Careful using a custom tokenizer...
Sep 15, 2020
I tried to use the spaCy tokenizer, nltk
word_tokenizer
,sacremoses
MosesTokenizer
, nltkTreebankWordTokenizer
, and nltkTweetTokenizer
.For this example,
"inch BBL, unquote, cost $29.95"
they will all output['inch', 'BBL', ',', 'unquote', ',', 'cost', '$', '29.95', '.']
. This output is incompatible withnormalise
because it'll predict"inch B B L, unquote, cost $twenty nine point nine five."
.The text was updated successfully, but these errors were encountered: