Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use twitter-text to extract hashtags, mentions, and URLs #44

Open
jrnold opened this issue Aug 24, 2017 · 1 comment
Open

Use twitter-text to extract hashtags, mentions, and URLs #44

jrnold opened this issue Aug 24, 2017 · 1 comment

Comments

@jrnold
Copy link

jrnold commented Aug 24, 2017

Currently the tokenizer has it's own regex's for hashtags, mentions, and URLs (and there's a comment about what the best URL pattern is). Twitter maintains a java library twitter-text that can extract these and handles all sorts of weird edge-cases. It also has a pretty good regex for getting URLs that aren't preceded by a protocol. Offloading the identification of the twitter-specific tokens to the twitter-maintained library would probably improve the identification of those items (or at the very least, mean it's making the same mistakes as Twitter itself)

@brendano
Copy link
Owner

It would be great to see a diff of tokenization under twokenize's current rules, versus what it is when using twitter-text's rules.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants