Use twitter-text to extract hashtags, mentions, and URLs #44

jrnold · 2017-08-24T17:42:36Z

Currently the tokenizer has it's own regex's for hashtags, mentions, and URLs (and there's a comment about what the best URL pattern is). Twitter maintains a java library twitter-text that can extract these and handles all sorts of weird edge-cases. It also has a pretty good regex for getting URLs that aren't preceded by a protocol. Offloading the identification of the twitter-specific tokens to the twitter-maintained library would probably improve the identification of those items (or at the very least, mean it's making the same mistakes as Twitter itself)

brendano · 2017-08-27T20:51:03Z

It would be great to see a diff of tokenization under twokenize's current rules, versus what it is when using twitter-text's rules.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use twitter-text to extract hashtags, mentions, and URLs #44

Use twitter-text to extract hashtags, mentions, and URLs #44

jrnold commented Aug 24, 2017

brendano commented Aug 27, 2017

Use twitter-text to extract hashtags, mentions, and URLs #44

Use twitter-text to extract hashtags, mentions, and URLs #44

Comments

jrnold commented Aug 24, 2017

brendano commented Aug 27, 2017